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Preface 


These  lecture  notes  were  written  for  the  course  ORF  570:  Probability  in  High 
Dimension  that  I  taught  at  Princeton  in  the  Spring  2014  semester.  The  aim 
was  to  introduce  in  as  cohesive  a  manner  as  I  could  manage  a  set  of  methods, 
many  of  which  have  their  origin  in  probability  in  Banach  spaces,  that  arise 
across  a  broad  range  of  contemporary  problems  in  different  areas. 

The  notes  are  necessarily  incomplete.  The  ambitious  syllabus  for  the  course 
was  laughably  beyond  the  scope  of  Princeton’s  12-week  semester.  As  a  result, 
there  are  regrettable  omissions,  as  well  as  many  fascinating  topics  that  I  would 
have  liked  to  but  could  not  cover  in  the  context  of  this  course.  These  include: 

a.  Bernstein’s  inequality  does  not  appear  anywhere  in  these  notes  (disgrace¬ 
ful!),  nor  do  any  Bernstein-type  concentration  inequalities  (such  as  con¬ 
centration  of  the  exponential  distribution  and  Talagrand’s  concentration 
inequality  for  empirical  processes)  and  the  notion  of  modified  log-Sobolev 
inequalities.  These  should  be  included  at  the  end  of  Part  I. 

b.  Chaining  with  adaptive  truncation  and  entropy  with  brackets.  Beyond  be¬ 
ing  a  classical  topic  in  empirical  process  theory,  the  power  of  the  idea  of 
adaptive  truncation  has  again  proven  its  value  in  the  recent  solution  of  the 
long-standing  Bernoulli  problem  due  to  Bednorz  and  Latala. 

c.  Universality  (prematurely  included  in  Chapter  1  as  a  topic  to  be  covered 
though  I  did  not  have  time  to  do  so)  and  an  introduction  to  Stein’s  method. 

d.  Hypercontractivity  and  its  applications,  particularly  to  concentration  in¬ 
equalities  and  to  sharp  thresholds  (the  latter  should  be  promoted  to  a 
fourth  “general  principle”  in  Chapter  1  in  view  of  the  ubiquity  of  phase 
transition  phenomena  in  high-dimensional  problems). 

e.  No  doubt  this  list  will  grow  even  longer  if  I  don’t  stop  typing. 

Hopefully  the  opportunity  will  arise  in  the  future  to  fill  in  some  of  these  gaps, 
in  which  case  I  will  post  an  updated  version  of  these  notes  on  my  website.  For 
now,  as  always,  these  notes  are  made  available  as-is. 


VIII  Preface 


Please  note  that  these  are  lecture  notes,  not  a  monograph.  Many  important 
ideas  that  I  did  not  have  the  time  to  cover  are  included  as  problems  at  the 
end  of  each  section.  Doing  the  problems  is  the  best  way  to  learn  the  material. 
To  avoid  distraction  I  have  on  a  few  occasions  ignored  some  minor  technical 
issues  (such  as  measurability  issues  of  empirical  processes  or  domain  issues  of 
Markov  generators),  but  I  have  tried  to  give  the  reader  a  fair  warning  when 
this  is  the  case.  The  notes  at  the  end  of  each  chapter  do  not  claim  to  give  a 
comprehensive  historical  account,  but  rather  to  indicate  the  immediate  origin 
of  the  material  that  I  used  and  to  serve  as  a  starting  point  for  further  reading. 

Many  thanks  are  due  to  the  30  or  so  regular  participants  of  the  course. 
These  lecture  notes  are  loosely  based  on  notes  scribed  by  the  students  during 
the  lectures.  While  they  have  been  almost  entirely  rewritten,  the  scribe  notes 
served  as  a  crucial  motivation  to  keep  writing.  I  am  particularly  grateful  to 
Maria  Avdeeva,  Mark  Cerenzia,  Jacob  Funk,  Danny  Gitelman,  Max  Goer, 
Jiequn  Han,  Daniel  Jiang,  Mitchell  Johnston,  Haruko  Kato,  George  Kerchev, 
Dan  Lacker,  Che- Yu  Liu,  Yuan  Liu,  Huanran  Lu,  Junwei  Lu,  Tengyu  Ma,  Efe 
Onaran,  Zhaonan  Qu,  Patrick  Rebeschini,  Max  Simchowitz,  Weichen  Wang, 
Igor  Zabukovec,  Tianqi  Zhao,  and  Ziwei  Zhu  for  serving  as  scribes. 


Princeton, 
June  2014 
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Introduction 


1.1  What  is  this  course  about? 

What  is  probability  in  high  dimension?  There  is  no  good  answer  to  this  ques¬ 
tion.  High-dimensional  probabilistic  problems  arise  in  numerous  areas  of  sci¬ 
ence,  engineering,  and  mathematics.  A  (very  incomplete)  list  might  include: 

•  Large  random  structures:  random  matrices,  random  graphs,  . . . 

•  Statistics  and  machine  learning:  estimation,  prediction  and  model  selection 
for  high-dimensional  data. 

•  Randomized  algorithms  in  computer  science. 

•  Random  codes  in  information  theory. 

•  Statistical  physics:  Gibbs  measures,  percolation,  spin  glasses,  . . . 

•  Random  combinatorial  structures:  longest  increasing  subsequence,  span¬ 
ning  trees,  travelling  salesman  problem,  . . . 

•  Probability  in  Banach  spaces:  probabilistic  limit  theorems  for  Banach¬ 
valued  random  variables,  empirical  processes,  local  theory  of  Banach 
spaces,  geometric  functional  analysis,  convex  geometry. 

•  Mixing  times  and  other  phenomena  in  high-dimensional  Markov  chains. 

At  first  sight,  these  different  topics  appear  to  have  limited  relation  to  one 
another.  Each  of  these  areas  is  a  field  in  its  own  right,  with  its  own  unique 
ideas,  mathematical  methods,  etc.  In  fact,  even  the  high-dimensional  nature  of 
the  problems  involved  can  be  quite  distinct:  in  some  of  these  problems,  “high 
dimension”  refers  to  the  presence  of  many  distinct  but  interacting  random 
variables;  in  others,  the  problems  arise  in  high-dimensional  spaces  and  prob¬ 
abilistic  methods  enter  the  picture  indirectly.  It  would  be  out  of  the  question 
to  cover  all  of  these  topics  in  a  single  course. 

Despite  this  wide  array  of  quite  distinct  areas,  there  are  the  some  basic 
probabilistic  principles  and  techniques  that  arise  repeatedly  across  a  broad 
range  of  high-dimensional  problems.  These  ideas,  some  of  which  will  be  de¬ 
scribed  at  a  very  informal  level  below,  typically  take  the  form  of  nonasymp- 
totic  probabilistic  inequalities.  Here  nonasymptotic  means  that  we  are  not 
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concerned  with  limit  theorems  (as  in  many  classical  probabilistic  results),  but 
rather  with  explicit  estimates  that  are  either  dimension-free,  or  that  capture 
precisely  the  dependence  of  the  problem  on  the  relevant  dimensional  param¬ 
eters.  There  are  at  least  two  reasons  for  the  importance  of  such  methods. 
First,  in  many  high-dimensional  problems  there  may  be  several  different  pa¬ 
rameters  of  interest;  in  asymptotic  results  one  must  take  all  these  parameters 
to  the  limit  in  a  fixed  relation  to  one  another,  while  the  nonasymptotic  view¬ 
point  allows  to  express  the  interrelation  between  the  different  parameters  in 
a  much  more  precise  way.  More  importantly,  high-dimensional  problems  typi¬ 
cally  involve  interactions  between  a  large  number  of  degrees  of  freedom  whose 
aggregate  contributions  to  the  phenomenon  of  interest  must  be  accounted  for 
in  the  mathematical  analysis;  the  explicit  nature  of  nonasymptotic  estimates 
makes  them  particularly  well  suited  to  be  used  as  basic  ingredients  of  the 
analysis,  even  if  the  ultimate  result  of  interest  is  asymptotic  in  nature. 

The  goal  of  this  course  is  to  develop  a  set  of  tools  that  are  used  repeatedly 
in  the  investigation  of  high-dimensional  random  structures  across  different 
fields.  Our  aim  will  not  only  be  to  build  up  this  common  toolbox  in  a  sys¬ 
tematic  way,  but  we  will  also  attempt  to  show  how  these  tools  fit  together  to 
yield  a  surprisingly  cohesive  set  of  probabilistic  ideas.  Of  course,  one  should 
not  expect  that  any  genuinely  interesting  problem  that  arises  in  one  of  the 
various  fascinating  areas  listed  above  can  be  resolved  by  an  immediate  appli¬ 
cation  of  a  tool  in  our  toolbox;  the  solution  of  such  problems  typically  requires 
insights  that  are  specific  to  each  area.  However,  the  common  set  of  ideas  that 
we  will  develop  provides  key  ingredients  for  the  investigation  of  many  high¬ 
dimensional  problems,  and  forms  an  essential  basis  for  work  in  this  area. 


1.2  Some  general  principles 

The  toolbox  that  we  will  develop  is  equipped  to  address  a  number  of  different 
phenomena  that  arise  in  high  dimension.  To  give  a  broad  overview  of  some 
of  the  ideas  to  be  developed,  and  to  set  the  stage  for  coming  attractions,  we 
will  organize  our  theory  around  three  informal  “principles”  to  be  described 
presently.  None  of  these  principles  corresponds  to  one  particular  theorem  or 
admits  a  precise  mathematical  description;  rather,  each  principle  encompasses 
a  family  of  conceptually  related  results  that  appear  in  different  guises  in  dif¬ 
ferent  settings.  The  bulk  of  this  course  is  aimed  at  making  these  ideas  precise. 


1.2.1  Concentration 


If  X±,  X2 , . .  •  are  i.i.d.  random  variables,  then 


1 

n 


Y'Xk-  E 

k= 1 


1 

n 


k= 1 


0  as  n  — >  00 


1.2  Some  general  principles 


3 


by  the  law  of  large  numbers.  Another  way  of  stating  this  is  as  follows:  if  we 
define  the  function  f(x i, . . . ,  xn )  =  ^  J2k- 1  xki  then  for  large  n  the  random 
variable  f(X . . . ,  Xn )  is  close  to  its  mean  (that  is,  its  fluctuations  are  small). 

It  turns  out  that  this  phenomenon  is  not  restricted  to  linear  functions  /: 
it  is  a  manifestation  of  a  general  principle,  the  concentration  phenomenon,  by 
virtue  of  which  it  is  very  common  for  functions  of  many  independent  variables 
to  have  small  fluctuations.  Let  us  informally  state  this  principle  as  follows. 

If  X i,...,Xn  are  independent  (or  weakly  dependent)  random  vari¬ 
ables,  then  the  random  variable  f(X\, . . .  ,Xn)  is  “close”  to  its  mean 
E[/(Xi, . . . ,  Xn)}  provided  that  the  function  f{x i, . . . ,  xn)  is  not  too 
“sensitive”  to  any  of  the  coordinates  Xi. 

Of  course,  to  make  such  a  statement  precise,  we  have  to  specify: 

•  What  do  we  mean  by  “sensitive”? 

•  What  do  we  mean  by  “close”? 

We  will  develop  a  collection  of  results,  and  some  general  methods  to  prove 
such  results  in  different  settings,  in  which  these  concepts  are  given  a  pre¬ 
cise  meaning.  In  each  case,  such  a  result  takes  the  form  of  an  explicit  bound 
on  a  quantity  that  measures  the  size  of  the  fluctuations  f(Xi,...,Xn)  — 
E[/(Xl,  . . . ,  Xn)\  (such  as  the  variance  or  tail  probabilities)  in  terms  of  “di¬ 
mension”  n  and  properties  of  the  distribution  of  the  random  variables  X,  . 

At  this  point,  it  is  perhaps  far  from  clear  why  a  principle  of  the  above 
type  might  be  expected  to  hold.  We  will  develop  a  number  of  general  tools 
to  prove  such  results  that  provide  insight  into  the  nature  of  concentration,  as 
well  as  its  connection  with  other  topics.  One  theme  that  will  arise  repeatedly 
in  the  sequel  is  the  connection  between  concentration  and  the  rate  of  conver¬ 
gence  to  equilibrium  of  Markov  processes.  At  first  sight,  these  appear  to  be 
entirely  different  questions:  the  concentration  problem  is  concerned  with  the 
fluctuations  of  f(X)  for  a  given  (vector-valued)  random  variable  X  and  (pos¬ 
sibly  very  nonlinear)  function  /,  with  no  Markov  process  in  sight.  Nonethe¬ 
less,  it  turns  out  that  one  can  prove  concentration  properties  by  investigating 
Markov  processes  that  have  the  distribution  of  X  as  their  stationary  distribu¬ 
tion.  Conversely,  functional  inequalities  closely  connected  to  concentration  can 
be  used  to  investigate  the  convergence  of  Markov  processes  to  the  stationary 
distribution  (which  is  of  interest  in  its  own  right  in  many  areas,  for  exam¬ 
ple,  in  non-equilibrium  statistical  mechanics  or  Markov  chain  Monte  Carlo 
algorithms).  Once  this  connection  has  been  understood,  it  will  also  become 
clear  in  what  manner  such  results  can  be  systematically  improved.  This  will 
lead  us  to  the  notion  of  hypercontractivity  of  Markov  semigroups,  which  is 
in  turn  of  great  interest  in  various  other  probabilistic  problems.  Several  other 
connections  that  yield  significant  insight  into  the  concentration  phenomenon, 
including  to  isoperimetric  problems  and  problems  in  optimal  transportation 
and  information  theory,  will  be  developed  along  the  way. 
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1.2.2  Suprema 

The  concentration  principle  is  concerned  with  the  deviation  of  a  random  func¬ 
tion  f(X l,  . . . ,  Xn)  from  its  mean  E[f(Xi, . . . ,  Xn)\.  However,  it  does  not  pro¬ 
vide  any  information  on  the  value  of  E[f(X\, . . . ,  Xn )]  itself.  In  fact,  the  two 
problems  of  estimating  the  magnitude  and  the  fluctuations  of  f(X i, . . . ,  Xn) 
prove  to  be  quite  distinct,  and  must  be  treated  by  different  methods. 

A  remarkable  feature  of  the  concentration  principle  is  that  it  provides 
information  on  the  fluctuations  for  very  general  functions  /:  even  in  cases 
where  the  function  f  is  very  complicated  to  compute  (for  example,  when  it  is 
defined  in  terms  of  a  combinatorial  optimization  problem),  it  is  often  possible 
to  estimate  its  sensitivity  to  the  coordinates  by  elementary  methods.  When 
it  comes  to  estimating  the  magnitude  of  the  corresponding  random  variable, 
there  is  no  hope  to  develop  a  principle  that  holds  at  this  level  of  generality:  the 
functions  /  that  arise  in  the  different  areas  described  in  the  previous  section 
are  very  different  in  nature,  and  we  cannot  hope  to  develop  general  tools  to 
address  such  problems  without  assuming  some  additional  structure. 

A  structure  that  proves  to  be  of  central  importance  in  many  high¬ 
dimensional  problems  is  that  of  random  variables  F  defined  as  the  supremum 

F  =  sup  Xt 

t£T 

of  a  random  process  {Xt}teT  (that  is,  a  family  of  random  variables  indexed  by 
a  set  T  that  is  frequently  high-  or  infinite-dimensional) .  The  reason  that  such 
quantities  play  an  important  role  in  high-dimensional  problems  is  twofold.  On 
the  one  hand,  problems  in  high  dimension  typically  involve  a  large  number  of 
interdependent  degrees  of  freedom;  the  need  to  obtain  simultaneous  control 
over  many  random  variables  thus  arises  frequently  as  an  ingredient  of  the 
mathematical  analysis.  On  the  other  hand,  there  are  many  problems  in  which 
various  quantities  of  interest  can  be  naturally  expressed  in  terms  of  suprema. 
Let  us  consider  a  few  simple  examples  for  sake  of  illustration. 

Example  1.1  (Random  matrices).  Let  M  =  (My)i<y<n  be  a  random  matrix 
whose  entries  A'Uj  are  independent  (let  us  assume  they  are  Gaussian  for  sake 
of  illustration).  One  question  of  interest  in  this  setting  is  to  estimate  the 
magnitude  of  the  matrix  norm  \\M\\  (the  largest  singular  value  of  M),  which 
is  a  nontrivial  function  of  matrix  entries.  But  recall  from  linear  algebra  that 

||M||  =  sup  ( v,Mw ), 

V,w£B  2 

where  B-2  is  the  (Euclidean)  unit  ball  and  (•,  •)  denotes  the  usual  inner  product 
in  R".  We  can  therefore  treat  the  matrix  norm  \\M\\  as  the  supremum  of  the 
Gaussian  process  {XVtW  =  (v,  Mw)}VtW^B2  indexed  by  B2  x  I?2- 

Example  1.2  (Norms  of  random  vectors).  Let  X  be  a  random  vector  in  Rn, 
and  let  ||  •  ||b  be  any  norm  on  Rn  (where  B  denotes  the  unit  ball  of  ||  •  ||b)- 
The  duality  theory  of  Banach  spaces  implies  that  we  can  write 


1.2  Some  general  principles 


5 


ll*l|s  =  sup  (t,X), 

t£B° 

where  B°  denotes  the  dual  ball.  In  this  manner,  the  supremum  of  the  random 
process  {Xt  =  (t,X)}t£B°  arises  naturally  in  probability  in  Banach  spaces. 

Example  1.3  (Empirical  risk  minimization).  Many  problems  in  statistics  and 
machine  learning  may  be  formulated  as  the  problem  of  computing 

argmin  E[Z($,  X)] 

<9e0 

given  only  observed  “data”  consisting  of  i.i.d.  samples  X\, . . . ,  Xn  ~  X  (that 
is,  without  knowledge  of  the  law  of  X).  Here  l  is  a  given  loss  function  and  0 
is  a  given  parameter  space,  which  depend  on  the  problem  at  hand. 

Perhaps  the  simplest  general  way  to  address  this  problem  is  to  reason  as 
follows.  By  the  law  of  large  numbers,  we  can  approximate  the  risk  for  a  fixed 
parameter  8  by  the  empirical  risk  which  depends  only  on  the  data: 

1  " 

E [1{8,X)\k  -Y^l(8,Xk). 

nti 

On  might  therefore  naturally  expect  that 

1  n 

argmin E[Z(0,  X)]  ss  argmin  —  V  1(9 ,  X^). 
flee  see  n 

This  approach  to  estimating  the  optimal  parameter  8  from  data  is  called 
empirical  risk  minimization.  The  problem  is  now  to  estimate  how  close  the 
empirical  risk  minimizer  is  to  the  optimal  parameter  as  a  function  of  the 
number  of  samples  n,  the  dimension  of  the  parameter  space  (9,  the  dimension 
of  the  state  space  of  X,  etcetera.  The  resolution  of  this  question  leads  naturally 
to  the  investigation  of  quantities  such  as  the  uniform  deviation 

1  n 

sup-^{Z(0,Xfc)-E (1(8,  X)}}, 
see  n  ^ 

which  is  the  supremum  of  a  random  process.  Estimating  the  magnitude  of 
suprema  arises  in  a  similar  manner  in  a  wide  array  of  statistical  problems. 

Example  1-4  (Convex  functions).  In  principle,  we  can  formulate  the  problem 
of  estimating  E[/(X-|, . . . ,  Xn)]  as  a  supremum  problem  whenever  /  is  convex. 
Indeed,  by  convex  duality,  we  can  express  any  convex  function  /  :  M”  — >  R  as 

fix)  =  sup  {(y,x)  - 
ye  R" 

where  f*  denotes  the  convex  conjugate  of  /.  The  function  f(X i,...,Xn) 
can  therefore  be  expressed  as  the  supremum  of  the  random  process  {Xy  = 
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{y,X)}ye^n  after  subtracting  the  “penalty”  f*(y)  (alternatively,  f*  can  be 
absorbed  in  the  definition  of  Xy).  This  shows  that  the  investigation  of  suprema 
is  in  fact  surprisingly  general;  this  general  point  of  view  is  very  useful  in  some 
applications,  while  more  direct  methods  might  be  more  suitable  in  other  cases. 

In  all  these  cases,  the  process  Xt  itself  admits  a  simple  description,  and  the 
difficulty  lies  in  obtaining  good  estimates  on  the  magnitude  of  the  supremum 
(for  example,  to  estimate  the  mean  or  the  tail  probabilities).  In  this  setting, 
a  second  general  principle  appears  that  provides  a  key  tool  in  many  high¬ 
dimensional  problems.  We  informally  state  this  principle  as  follows. 

If  the  random  process  {Xt}t!zT  is  “sufficiently  continuous,”  then  the 
magnitude  of  the  supremum  suptgT  Xt  is  controlled  (in  the  sense  that 
we  have  estimates  from  above,  and  in  some  cases  also  from  below)  by 
the  “ complexity  ”  of  the  index  set  T. 

Of  course,  to  make  this  precise,  we  have  to  specify: 

•  What  do  we  mean  by  “sufficiently  continuous”? 

•  What  do  we  mean  by  “complexity”? 

These  concepts  will  be  given  a  precise  meaning  in  the  sequel.  In  particular,  let 
us  note  that  while  the  supremum  of  a  random  process  is  a  probabilistic  object, 
complexity  is  not:  we  will  in  fact  consider  different  geometric  (packing  and 
covering  numbers  and  trees)  and  combinatorial  (shattering  and  combinatorial 
dimension)  notions  of  complexity.  We  will  develop  a  collection  of  powerful 
tools,  such  as  chaining  and  slicing  methods,  that  make  the  connection  between 
these  probabilistic,  geometric,  and  combinatorial  notions  in  a  general  setting. 
A  number  of  other  useful  tools  will  be  developed  along  the  way,  such  as  basic 
methods  for  bounding  Gaussian  and  Rademacher  processes. 


1.2.3  Universality 

Let  Xx,X2, ...  be  i.i.d.  random  variables  with  finite  variance.  As  in  our  dis¬ 
cussion  of  concentration,  let  us  recall  once  more  the  law  of  large  numbers 

1  n 

—  ^{A'fc  —  EXfc}  — >  0  as  n  — >  oo. 
n  k= i 

In  this  setting,  however,  we  do  not  only  know  that  the  fluctuations  are  of 
order  n-1/2  (as  is  captured  by  the  concentration  phenomenon),  but  we  have 
much  more  precise  information  as  well:  by  the  central  limit  theorem,  we  have 
a  precise  description  of  the  distribution  of  the  fluctuations,  as 

n 

^{Abc  —  EX*,}  «  Gaussian 
fc= l 
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when  n  is  large.  A  different  way  of  phrasing  this  property  is  that 

n  i  n 

~i=  V{Xfc  -  EXk}  «  -=  V{Gfc  -  EGfc}, 

v  fc= i  v  fe=i 

where  Gk  are  independent  Gaussian  random  variables  with  the  same  mean 
and  variance  of  Xk  (here  ss  denotes  closeness  of  the  distributions).  Beside 
the  fact  that  this  gives  precise  distributional  information,  what  is  remarkable 
about  such  results  is  that  they  become  insensitive  to  the  distribution  of  the 
original  random  variables  Xk  as  n  — >  oo.  The  phenomenon  that  the  detailed 
features  of  the  distribution  of  the  individual  components  of  a  problem  become 
irrelevant  in  high  dimension  is  often  referred  to  a  universality. 

As  in  the  case  of  concentration,  it  turns  out  that  this  phenomenon  is  not 
restricted  to  linear  functions  of  independent  random  variables,  but  is  in  fact 
a  manifestation  of  a  more  general  principle.  We  state  it  informally  as  follows. 

If  Xk, . . .  ,Xn  are  independent  (or  weakly  dependent)  random  vari¬ 
ables,  then  the  expectation  E[/(Xp . . . , Xn)\  is  “insensitive”  to  the 
distribution  of  X i, . . . ,  Xn  when  the  function  f  is  “sufficiently  smooth.” 

Of  course,  to  make  this  precise,  we  have  to  specify: 

•  What  do  we  mean  by  “insensitive”? 

•  What  do  we  mean  by  “sufficiently  smooth”? 

We  will  develop  some  basic  quantitative  methods  to  prove  universality  in 
which  these  concepts  are  given  a  precise  meaning. 

The  interest  of  the  universality  phenomenon  is  twofold.  First,  the  presence 
of  the  universality  property  suggests  that  the  high-dimensional  phenomenon 
under  investigation  is  in  a  sense  robust  to  the  precise  details  of  the  model 
ingredients,  a  conclusion  of  significant  interest  in  its  own  right  (of  course,  there 
are  also  many  high-dimensional  phenomena  that  are  not  universal!)  Second, 
there  are  often  situations  in  which  the  quantities  of  interest  can  be  evaluated 
by  explicit  computation  when  the  underlying  random  variables  have  a  special 
distribution,  but  where  such  explicit  analysis  would  be  impossible  in  a  general 
setting.  For  example,  in  random  matrix  theory,  many  explicit  computations 
are  possible  for  appropriately  defined  Gaussian  random  matrices  due  to  the 
invariance  of  the  distribution  under  orthogonal  transformations,  while  such 
computations  would  be  completely  intractable  for  other  distributions  of  the 
entries.  In  such  cases,  universality  properties  provide  a  crucial  tool  to  reduce 
the  proofs  of  general  results  to  those  in  a  tractable  special  case. 

Let  us  note  that  the  universality  phenomenon  is  not  necessarily  related 
to  the  Gaussian  distribution:  universality  simply  states  that  certain  proba¬ 
bilistic  quantities  do  not  depend  strongly  on  the  distribution  of  the  individual 
components.  However,  Gaussian  distributions  do  appear  frequently  in  many 
high-dimensional  problems  that  involve  the  aggregate  effect  of  many  inde¬ 
pendent  degrees  of  freedom,  as  do  several  other  distributions  (such  as  Poisson 
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distributions  in  discrete  problems  and  extreme  value  distributions  for  maxima 
of  independent  random  variables;  a  much  less  well  understood  phenomenon 
is  the  appearance  of  the  Tracy-Widom  distribution  in  many  complex  systems 
that  are  said  to  belong  to  the  “KPZ  universality  class,”  a  topic  of  intense  re¬ 
cent  activity  in  probability  theory.)  Thus  the  related  but  more  precise  question 
of  when  the  distribution  a  random  variable  F  is  close  to  Gaussian  or  to  some 
other  distribution  also  arises  naturally  in  this  setting.  Explicit  nonasymptotic 
estimates  in  terms  of  dimensional  parameters  of  the  problem  can  be  obtained 
using  a  set  of  tools  (collectively  known  as  Stein’s  method)  that  have  proved 
to  be  very  useful  in  a  number  of  high-dimensional  problems. 


1.3  Organization  of  this  course 

We  have  introduced  above  three  “principles”  to  motivate  some  of  the  general 
probabilistic  ideas  that  arise  in  high-dimensional  problems.  These  principles 
should  not  be  taken  too  seriously,  but  rather  an  informal  guide  to  place  into 
perspective  the  topics  that  we  will  cover  in  the  sequel.  In  the  following  lectures, 
we  will  proceed  to  develop  these  ideas  in  a  precise  manner,  and  to  exhibit  the 
many  interconnections  between  these  topics. 

Unfortunately,  there  is  a  lot  of  ground  to  cover,  probably  way  too  much 
for  a  single  semester.  Thus  some  hard  choices  will  likely  have  to  be  made, 
depending  on  the  interests  of  the  audience.  Let  us  start  out  ambitiously,  and 
see  how  things  develop  as  the  course  progresses. 


Part  I 


Concentration 


2 


Variance  bounds  and  Poincare  inequalities 


Recall  the  informal  statement  of  the  concentration  phenomenon  from  Ch.  1: 

If  X i,...,Xn  are  independent  (or  weakly  dependent)  random  vari¬ 
ables,  then  the  random  variable  f(X i,  . . .  ,Xn)  is  “close”  to  its  mean 
E  f(Xi, . . .  ,Xn)  provided  that  the  function  f(x  i,...,xn)  is  not  too 
“sensitive”  to  any  of  the  coordinates  Xi . 

In  this  chapter,  we  will  make  a  modest  start  towards  making  this  principle 
precise  by  investigating  bounds  on  the  variance 

Var  [/(*!,  ■  •  • ,  Xn)\  :=  E[(/(X1; . . .  ,Xn)  -  E  f(Xu  ...,  Xn ))2] 

in  terms  of  the  “sensitivity”  of  the  function  /  to  its  coordinates.  Various 
fundamental  ideas  and  a  rich  theory  already  arise  in  this  setting,  and  this  is 
therefore  our  natural  starting  point.  In  the  following  chapters  we  will  show 
how  to  go  beyond  the  variance  to  obtain  bounds  on  the  distribution  of  the 
fluctuations  of  /(X l,  . . . ,  Xn)  that  are  useful  in  many  settings. 

2.1  Tensorization  and  bounded  differences 

At  first  sight,  it  might  seem  that  the  concentration  principle  is  rather  trivial 
when  stated  in  terms  of  variance.  Indeed,  the  variance  of  a  constant  function 
is  zero,  and  it  is  easy  to  show  that  the  variance  of  a  function  that  is  almost 
constant  is  almost  zero.  For  example,  we  have  the  following  simple  lemma: 

Lemma  2.1.  Let  X  be  any  (possibly  vector-valued)  random  variable.  Then 

Var[/(X)]  <  j(sup /  —  inf f)2  and  Var[/(X)]  <  E[(/(X)  -  inf/)2]. 

Proof.  Note  that 

Var[/(X)]  =  Var [f(X)  —  a]  <  E[(/(X)  —  a)2]  for  any  a  €  M. 

For  the  first  inequality,  let  a  =  (sup  /  +  inf  /)/ 2  and  note  that  |  f(X)  —  a|  < 
(sup  /  —  inf  /)/ 2.  For  the  second  inequality,  let  a  =  inf  /.  □ 
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The  problem  with  this  trivial  result  is  that  it  does  not  capture  at  all  the 
high- dimensional  phenomenon  that  we  set  out  to  investigate.  For  example,  it 
gives  a  terrible  bound  for  the  law  of  large  numbers. 

Example  2.2.  Let  X\, . . . ,  Xn  be  independent  random  variables  with  values  in 
[—1, 1],  and  let  f(x i, . . . ,  xn )  =  ^  i  xk-  Then  a  direct  computation  gives 

1  n  1 

Var[/(Xi, . .  .,X„)]  =  -  Y]  Var  [Xk]  <  -. 

*= l 

That  is,  the  average  of  i.i.cl.  random  variables  concentrates  increasingly  well 
around  its  mean  as  the  dimension  is  increased.  On  the  other  hand,  both  bounds 
of  Lemma  2.1  give  Var[/(Xi, . . . ,  Xn)\  <  1:  for  example, 

Var[/(Xi, . . .  ,Xn)\  <  ^-(sup/  -  inf  /)2  =  1. 

Thus  Lemma  2.1  provides  a  reasonable  bound  on  the  variance  in  one  dimen¬ 
sion,  but  is  grossly  inadequate  in  high  dimension. 

Of  course,  this  should  not  be  surprising:  no  independence  was  assumed  in 
Lemma  2.1,  and  so  there  is  no  reason  which  we  should  obtain  a  sharper  con¬ 
centration  phenomenon  at  this  level  of  generality.  For  example,  if  X1: . . . ,  Xn 
are  random  variables  that  are  totally  dependent  X\  =  Xi  =  . . .  =  Xn,  then 
the  variance  of  L  X k  is  indeed  of  order  one  regardless  of  the  “dimension” 

n ,  and  Lemma  2.1  captures  this  situation  accurately.  The  idea  that  concentra¬ 
tion  should  improve  in  high  dimension  arises  when  there  are  many  independent 
degrees  of  freedom.  To  capture  this  high-dimensional  phenomenon,  we  must 
develop  a  method  to  exploit  independence  in  our  inequalities. 

To  this  end,  we  presently  introduce  an  idea  that  appears  frequently  in  high¬ 
dimensional  problems:  we  will  deduce  a  bound  for  functions  of  independent 
random  variables  Xi, . . .  ,Xn  (i.e.,  in  high  dimension)  from  bounds  for  func¬ 
tions  of  each  individual  random  variable  Xj  (i.e.,  in  a  single  dimension).  It  is 
not  at  all  obvious  that  this  is  possible:  in  general,  one  cannot  expect  to  deduce 
high-dimensional  inequalities  from  low-dimensional  ones  without  introducing 
additional  dimension-dependent  factors.  Those  quantities  for  which  this  is  in 
fact  possible  are  said  to  tensorize.1  Quantities  that  tensorize  behave  well  in 
high  dimension,  and  are  therefore  particularly  important  in  high-dimensional 
problems.  We  will  presently  prove  that  the  variance  is  such  a  quantity.  With 
the  tensorization  inequality  for  the  variance  in  hand,  we  will  have  reduced  the 
proof  of  concentration  inequalities  for  functions  of  many  independent  random 
variables  to  obtaining  such  bounds  for  a  single  random  variable. 


The  joint  law  p,\  0  •  •  •  0  fin  of  independent  random  variables  X\ , . . . ,  Xn  is  the 
tensor  product  of  the  marginal  laws  Xi  ~  pa\  the  terminology  “tensorization” 
indicates  that  a  quantity  is  well  behaved  under  the  formation  of  tensor  products. 


2.1  Tensorization  and  bounded  differences 
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To  formulate  the  tensorization  inequality,  let  X\ , . . . ,  Xn  be  independent 
random  variables.  For  each  function  /(aq, . . . ,  xn),  we  define  the  function 

Vaiif(xi, . .  ,,x„)  :=  Var[/(aq, . . . ,  Xi-i,  Xu  xi+1, . .  .,xn)]. 

That  is,  Xaiif(x)  is  the  variance  of  f(X i, . . . ,  Xn)  with  respect  to  the  variable 
Xi  only,  the  remaining  variables  being  kept  fixed. 

Theorem  2.3  (Tensorization  of  variance).  We  have 


n 

Var [f(Xlt  . . . ,  Xn)}  <  E  ^  Vari/(X1, . . . ,  Xn) 

.  i= 1 

whenever  Xi, . . . ,  Xn  are  independent. 

Note  that  when  /  is  a  linear  function,  it  is  readily  checked  that  the  in¬ 
equality  of  Theorem  2.3  holds  with  equality:  in  this  sense,  the  result  is  sharp. 

The  proof  of  Theorem  2.3  is  a  first  example  of  the  martingale  method , 
which  will  prove  useful  for  obtaining  more  general  inequalities  later  on. 

Proof.  Define 

Ak  =  E  [fiX,, . . . ,  Xn)\Xu  ...,Xk\-  E  [/(*!, . . . ,  Xn)\Xu  ..., 

Then 

n 

f(X  1; .  ..,Xn)~  E  f(Xu  ...,  Xn)  =  Y,  Ak, 

k= 1 

and  E[zl/c|Xi, . . . ,  Xk-i ]  =  0,  that  is,  . . . ,  Ak  are  martingale  increments. 
In  particular,  as  E [AkA{\  —  E[E[Z\fc|X!, . . .  ,Xk-i] A{\  =  0  for  l  <  A:,  we  have 

/  n  \^~  n 

=EE^]- 

\fc=i  /  j  fc= i 

It  remains  to  show  that  E[Z\j(]  <  E[Varfc/(Xi, . . .  ,Xn)}  for  every  k. 

To  this  end,  note  that 

E[f(X1,...,Xn)\X1,...,Xk_1\ 

=  E[E  [f(Xu.  ..,Xn)\Xu...,  Xk_!,Xk+1,  ...,Xn]\Xu...,  Xk-i] 

=  E[E  [/(Xi, . . . ,  Xn)\Xu . .  .,Xk_!,Xk+1, . . . ,  Xn]\Xlt ...,  Xk], 

where  we  have  used  the  tower  property  of  the  conditional  expectation  in  the 
first  equality,  and  that  Xk  is  independent  of  X±, . . . ,  Xk_i,  Xk+kl . . . ,  Xn  in 
the  second  equality.  In  particular,  we  can  write  Ak  =  E[Z\fc|Xi, . . . ,  Xk ]  with 


Var[/(Xll. , .  ,Xn)\  =  E 


=  f(Xu  ...,Xn)~  E  [/(*!, . . . ,  Xn)\Xi, . . . ,  xk-uxk+l,  ...,xn}. 
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But  as  Xk  and  X-t. ... .  Xk-i,  Xk+i,  •  ■  • ,  Xn  are  independent,  we  have 
Var  */(*!  r...,Xn)=-E[Al\X1,...,Xk^1,  Xk+1  ,...,Xn\. 

We  can  therefore  estimate  using  Jensen’s  inequality 

E  [A2]  =  nnAklXu . .  .,Xk}2]  <  E  [All  =  E[Var  kf(Xu  . .  .,Xn)], 
which  completes  the  proof.  □ 

One  can  view  tensorization  of  the  variance  in  itself  as  an  expression  of  the 
concentration  phenomenon:  Var if{x)  quantifies  the  sensitivity  if  the  function 
/( x)  to  the  coordinate  Xi  in  a  distribution-dependent  manner.  Thus  Theorem 
2.3  already  the  expresses  the  idea  that  if  the  sensitivity  of  /  to  each  coor¬ 
dinate  is  small,  then  f(Xi, . . .  ,Xn )  is  close  to  its  mean.  Unlike  Lemma  2.1, 
however,  Theorem  2.3  holds  with  equality  for  linear  functions  and  thus  cap¬ 
tures  precisely  the  behavior  of  the  variance  in  the  law  of  large  numbers.  The 
tensorization  inequality  generalizes  this  idea  to  arbitrary  nonlinear  functions, 
and  constitutes  our  first  nontrivial  concentration  result. 

However,  it  may  not  be  straightforward  to  compute  Var,/:  this  quantity 
depends  not  only  on  the  function  /,  but  also  on  the  distribution  of  Xi.  In 
many  cases,  Theorem  2.3  is  the  most  useful  in  combination  with  a  suitable 
bound  on  the  variances  Var  if  in  each  dimension.  Even  the  trivial  bounds  of 
Lemma  2.1  already  suffice  to  obtain  a  variance  bound  that  is  extremely  useful 
in  many  cases.  To  this  end,  let  us  define  the  quanities 

Dif(x) := 

sup  f(x  1, . . . ,  Xi-i,z,  Zj+i, . . . ,  xn)  -  inf  f(x  i, . . . ,  Xi-i,z,  xi+1,  ...,xn) 

z  z 


and 


Di  f(x )  :=  f(x  i,...,xn)  -  inf  f(x!, . . . ,  Xi-!,  z,  xi+1, . . . ,  xn). 

Z 


Then  Dif(x)  and  D~  f(x)  quantify  the  sensitivity  of  the  function  /( x)  to  the 
coordinate  Xi  in  a  distribution-independent  manner.  The  following  bounds 
now  follow  immediately  from  Theorem  2.3  and  Lemma  2.1. 

Corollary  2.4  (Bounded  difference  inequalities).  We  have 


and 


Var[/(X1, . . . ,  Xn)\  <  ^E 


J2W(X Xn))2 

i= 1 


Var[/(X1, . . . ,  Xn)\  <  E 


J2(D~f(Xll...,Xn))2 

i= 1 


whenever  X±, . . . ,  Xn  are  independent 


2.1  Tensorization  and  bounded  differences 
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Let  us  illustrate  the  utility  of  these  inequalities  in  a  nontrivial  example. 

Example  2.5  (Random  matrices).  Let  M  be  an  nxn  symmetric  matrix  where 
{My  :  i  >  j  j  are  i.i.d.  symmetric  Bernoulli  random  variables  P [Ml;j  =  ±1]  = 
\.  We  are  interested  in  Amax(Af),  the  largest  eigenvalue  of  M.  This  is  a  highly 
nonlinear  function  of  the  entries:  it  is  not  immediately  obvious  what  is  the 
order  of  magnitude  of  either  the  mean  or  the  variance  of  Amax(Af). 

Recall  from  linear  algebra  that 

Amax(M)  =  sup  (v,Mv)  =  (%ax(M),Mt)max(M)), 

VGB2 

where  i?2  =  {v  €  R”  :  ||v||2  <  1}  is  the  Euclidean  unit  ball  in  R"  and 
vma,x(M)  is  any  eigenvector  of  M  with  eigenvalue  Amax(M).  Since  Amax(M)  is 
the  supremum  of  a  random  process,  we  will  be  able  to  use  tools  from  the  second 
part  of  this  course  to  estimate  its  mean:  it  will  turn  out  that  E[Amax(Af)]  ~ 
y/n.  Let  us  now  use  Corollary  2.4  to  estimate  the  variance. 

Let  us  consider  for  the  time  being  a  fixed  matrix  M  and  indices  i  >  j. 
Choose  a  symmetric  matrix  M~  such  that 

Amax(Af_)  =  inf  Amax(Af), 

Mij 

that  is,  M~j  =  Mjl  is  chosen  to  minimize  Amax(AL“)  while  the  remaining 
entries  =  Mki  with  {fc, /}  ^  {i,j}  are  kept  fixed.  Then  we  can  estimate 

AyAmax(Af)  =  Amax(Af)  -  Amax(Af-) 

=  (iWx(Af),  Mvmax(M))  -  sup  (v,M~v) 

ve  b2 

<  (^max(Af),  (M  -  M~)vmax(M)) 

=  2vmax(M)ivmax(M)j(Mij  -  My) 

—  4|^max  (Af)i||umax(Af)j|, 

where  the  penultimate  line  holds  as  Affe;  =  Mr,  unless  k  =  i,l  =  j  or  k  = 
j,l  =  i,  and  the  last  line  holds  as  only  take  the  values  ±1.  As  this 

inequality  holds  for  every  matrix  M  and  indices  i,j,  Corollary  2.4  yields 


Var[Amax(Af)]  <  E 


y  16|umax(Af)i|2|umax(Af)J 


i>j 


<  16, 


where  we  have  used  that  X^r=i  'l’max(A f)f  =  1.  Thus  the  variance  of  the  max¬ 
imal  eigenvalue  of  an  nxn  symmetric  random  matrix  with  Bernoulli  entries 
is  bounded  uniformly  in  the  dimension  n  (in  contrast  to  the  mean  ~  y/n). 


Remark  2.6.  It  is  natural  to  ask  whether  the  result  of  Example  2.5  is  sharp:  is 
Var[Amax(Af)]  in  fact  of  constant  order  as  n  — »  oo?  It  turns  out  that  this  is  not 
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the  case:  using  highly  nontrivial  specialized  tools  from  random  matrix  theory, 
it  can  be  shown  that  in  fact  Var[Amax(M)]  ~  ?r-1/3,  that  is,  the  fluctuations  of 
the  maximal  eigenvalue  in  high  dimension  are  even  smaller  than  is  predicted 
by  Corollary  2.4.  In  this  example,  the  suboptimal  bound  already  arises  at  the 
level  of  the  tensorization  inequality:  none  of  the  methods  developed  here  can 
beat  dimension-free  rate  obtained  in  Example  2.5. 

Thus  this  example  highlights  the  fact  that  one  cannot  always  expect  to 
obtain  an  optimal  bound  by  the  application  of  a  general  theorem.  However, 
this  in  no  way  diminishes  the  utility  of  these  inequalities,  whose  aim  is  to 
provide  general  principles  for  obtaining  concentration  properties  in  high  di¬ 
mension.  Indeed,  even  in  the  present  example,  we  already  obtained  a  genuinely 
nontrivial  result — a  dimension-free  bound  on  the  variance — using  a  remark¬ 
ably  simple  analysis  that  did  not  use  any  special  structure  of  random  matrix 
problems.  In  many  applications  such  dimension-free  bounds  suffice,  or  provide 
essential  ingredients  for  a  more  delicate  problem-specific  analysis.  It  should 
also  be  noted  that  there  are  many  problems  in  which  results  such  as  Corollary 
2.4  do  give  bounds  of  the  optimal  order  (for  example,  for  linear  functions  /). 
Whether  there  exist  general  principles  that  can  capture  the  improved  order 
of  the  fluctuations  in  settings  such  as  Example  2.5 — the  superconcentration 
problem — remains  a  largely  open  question.  This  is  an  active  research  area. 

The  bounded  difference  inequalities  of  Corollary  2.4,  and  the  tensorization 
inequality  2.3,  are  very  useful  in  many  settings.  On  the  other  hand,  these 
inequalities  can  often  be  restrictive  due  to  various  drawbacks: 

•  Due  to  the  supremum  and  infimum  in  the  definition  of  £//  or  D~ /,  bounds 
using  bounded  difference  inequalities  are  typically  restricted  to  situations 
where  the  random  variables  Xi  and/or  the  function  /  are  bounded.  For 
example,  the  computation  in  Example  2.5  is  useless  for  random  matrices 
with  Gaussian  entries.  On  the  other  hand,  the  tensorization  inequality 
itself  does  not  require  boundedness,  but  in  nontrivial  problems  such  as 
Example  2.5  it  is  typically  far  from  clear  how  to  bound  Var ;/. 

•  Bounded  difference  inequalities  do  not  capture  any  information  on  the 
distribution  of  X, .  For  example,  suppose  X\,. . .  ,Xn  are  i.i.d.,  and  con¬ 
sider  f(x)  =  ~^  J2k=ixk-  Then  Var[f(X1,...,Xn)]  =  Var[AG],  but  the 
bounded  difference  inequality  only  gives  Var[/(Xi, . . . ,  X„)]  <  IjAiH^,. 
The  latter  will  be  very  pessimistic  when  Var[Xi]  <C  || ACi  || ^ .  On  the  other 
hand,  the  tensorization  inequality  is  too  distribution-dependent  in  that  it 
is  often  unclear  how  to  bound  Var;/  directly  for  a  given  distribution. 

•  The  tensorization  method  depends  fundamentally  on  the  independence  of 
X\, . . . , Xn:  it  is  not  clear  how  this  method  can  be  extended  beyond  inde¬ 
pendence  to  treat  more  general  classes  of  high-dimensional  distributions. 

To  address  these  issues,  we  must  develop  a  more  general  framework  for  un¬ 
derstanding  and  proving  variance  inequalities. 


2.1  Tensorization  and  bounded  differences 
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Let  us  note  that  the  inequalities  obtained  in  this  section  can  be  viewed  as 
special  cases  of  a  general  family  of  inequalities  that  are  informally  described  as 
follows.  We  can  interpret  Dif  as  a  type  of  “discrete  derivative  of  the  function 
/( x)  with  respect  to  the  variable  ar».”  Similarly,  D~ f  can  be  viewed  as  a 
one-sided  version  of  the  discrete  derivative.  More  vaguely,  one  could  also  view 
Var if  as  a  type  of  squared  discrete  derivative.  Thus  the  inequalities  of  this 
section  are,  rougly  speaking,  of  the  following  form: 

“  variance(/)  <  E[  ||gradient(/)||2  ].  ” 

Inequalities  of  this  type  are  called  Poincare  inequalities  (after  H.  Poincare  who 
first  published  such  an  inequality  for  the  uniform  distribution  on  a  bounded 
domain  in  R"  and  for  the  classical  notion  of  gradient,  ca.  1890).  It  turns  out 
that  the  validity  of  a  Poincare  inequality  for  a  given  distribution  is  intimately 
connected  the  convergence  rate  of  a  Markov  process  that  admits  that  distribu¬ 
tion  as  a  stationary  measure.  This  fundamental  connection  between  two  prob¬ 
abilistic  problems  provides  a  powerful  framework  to  understand  and  prove  a 
broad  range  of  Poincare  inequalities  for  different  distributions  and  with  various 
different  notions  of  “gradient”  (and,  conversely,  a  powerful  method  to  bound 
the  convergence  rate  of  Markov  processes  in  high  dimension — an  important 
problem  in  its  own  right  with  applications  in  areas  ranging  from  statistical 
mechanics  to  Markov  Chain  Monte  Carlo  algorithms  in  computer  science  and 
in  computational  statistics).  We  therefore  set  out  in  the  sequel  to  develop  this 
connection  in  some  detail.  Before  we  can  do  that,  however,  we  must  first  recall 
some  basic  elements  of  the  theory  of  Markov  processes. 


Problems 


2.1  (Banach- valued  sums).  Let  X\r... , Xn  be  random  variables  with  val¬ 
ues  in  a  Banach  space  ( B ,  ||  •  || b)-  Suppose  that  that  these  random  variables 
are  bounded  in  the  sense  that  ||Xj||B  <  C  a.s.  for  every  i.  Show  that 


This  is  a  simple  vector-valued  variant  of  the  elementary  fact  that  the  variance 
of  y  -Xfc  for  real- valued  random  variables  Xk  is  of  order  1 . 

2.2  (Rademacher  processes).  Let  E\,...,en  be  independent  symmetric 
Bernoulli  random  variables  P[e*  =  ±1]  =  \  (also  called  Rademacher  vari¬ 
ables),  let  T  C  Rn.  The  following  identity  is  completely  trivial: 


sup  Var 
teT 


n 

^  £ktk 
k= 1 


n 


=  sup 
teT 


£*2- 


Prove  the  following  nontrivial  fact: 
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Var 


sup  >  £ktk 


t£T 


k— 1 


<4sup^t 


teT 


k— 1 


Thus  taking  the  supremum  inside  the  variance  costs  at  most  a  constant  factor. 

2.3  (Bin  packing).  This  is  a  classical  application  of  bounded  difference  in¬ 
equalities.  Let  X-[, ,  Xn  be  i.i.d.  random  variables  with  values  in  [0, 1].  Each 
Xi  represents  the  size  of  a  package  to  be  shipped.  The  shipping  containers  are 
bins  of  size  1  (so  each  bin  can  hold  a  set  packages  whose  sizes  sum  to  at  most 
1).  Let  Bn  =  f(X i, . . .  ,Xn)  be  the  minimal  number  of  bins  needed  to  store 
the  packages.  Note  that  computing  Bn  is  a  hard  combinatorial  optimization 
problem,  but  we  can  bound  its  mean  and  variance  by  easy  arguments. 

a.  Show  that  Var[B„]  <  n/4. 

b.  Show  that  E[B„]  >  ?rE[Xi]. 

Thus  the  fluctuations  ~  y/n  of  Bn  are  much  smaller  than  its  magnitude  ~  n. 

2.4  (Order  statistics  and  spacings).  Let  X\, . . .  ,Xn  be  independent  ran¬ 
dom  variables,  and  denote  by  Xp)  >  ...  >  X(n)  their  decreasing  rearrange¬ 
ment  (so  .X’(i)  =  maxjX),  X(„)  =  min,  X,;,  etc.)  Show  that 

Var [X(fc)]  <  kE[(X(k)  -  X(fc+1))2]  for  1  <  k  <  n/2, 


and  that 


Var[X(fe)]  <(n-k  +  l)E[(X(fe_1}  -  X(fc))2]  for  n/2  <  k  <  n. 

Hint:  use  Corollary  2.4  creatively. 

2.5  (Convex  Poincare  inequality).  Let  Xi, . . . ,  Xn  be  independent  ran¬ 
dom  variables  taking  values  in  [a,  b\.  The  bounded  difference  inequalities  of 
Corollary  2.4  estimate  the  variance  Var [f(Xi,...,Xn)]  in  terms  of  discrete 
derivatives  Dif  or  D~  f  of  the  function  /.  The  goal  of  this  problem  is  to  show 
that  if  the  function  /  is  convex ,  then  one  can  obtain  a  similar  bound  in  terms 
of  the  ordinary  notion  of  derivative  Xif(x)  =  df{x)/dxi  in  R”. 

a.  Show  that  if  g  :  R  — >  R  is  convex,  then 

g{y )  ~  g(x)  >  g'{x)(y  -  x)  for  all  x,  y  G  R. 

b.  Show  using  part  a.  and  Corollary  2.4  that  if  f  :  Rra  — >  R  is  convex,  then 

Var [f(X1,...,Xn)\  <  (b  —  a)2  E[||V/(X1; . . .  ,X„)||2]. 

c.  Conclude  that  if  f  is  convex  and  L-Lipschitz,  i.e.,  |  f(x)  —  f(y)\  <  L\\x  —  y || 
for  all  x,y  £  [a,  b]n,  then  Var[/(Xi, . . . ,  Xn)\  <  L2(b  —  a)2. 


2.2  Markov  semigroups 
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2.2  Markov  semigroups 

A  (homogeneous)  Markov  process  (At)t6R+  is  a  random  process  that  satisfies 
the  Markov  property:  for  every  bounded  measurable  function  /  and  s,t  £  R+, 
there  is  a  bounded  measurable  function  Psf  such  that 

E[.f(Xt+s)\{Xr}r<t}  =  ( Psf)(Xt ). 

[We  do  not  put  any  restrictions  on  the  state  space:  Xt  can  take  values  in  any 
measurable  space  E,  and  the  functions  above  are  of  the  form  /  :  E  — >  R.] 
The  interpretation,  of  course,  is  classical:  the  behavior  of  the  process  in  the 
future  Xt+S  depends  only  on  the  history  to  date  {A'r}r<t  through  the  current 
state  Xt,  and  is  independent  of  the  prior  history;  that  is,  the  dynamics  of  the 
Markov  processes  are  memoryless.  The  assumption  that  Psf  does  not  also 
depend  on  t  in  the  above  expression  (the  homogeneity  property)  indicates 
that  the  same  dynamical  mechanism  is  used  at  each  time. 

A  probability  measure  p  is  called  stationary  or  invariant  if 

p(Ptf)  =  p(f)  for  all  t  £  R+,  bounded  measurable  /. 

To  interpret  this  notion,  suppose  that  Xq  ~  p.  Then 

E  [f(Xt)j  =  E[E[f(Xt)\X0}}  =  E[Pt/(X0)]  =  p(Ptf). 

Thus  if  p  is  stationary,  then  E [f(Xt)]  =  p(f)  for  every  t  £  R+  and  /:  in 
particular,  if  the  process  is  initially  distributed  according  to  the  stationary 
measure  Xq  ~  p,  then  the  process  remains  distributed  according  to  the  sta¬ 
tionary  measure  Xt  ~  p  for  every  time  t.  In  other  words,  stationary  measures 
describe  the  “steady-state”  or  “equilibrium”  behavior  of  a  Markov  process. 
Let  us  describe  a  few  basic  facts  about  the  functions  Pt  f. 

Lemma  2.7.  Let  p  be  a  stationary  measure.  Then  the  following  hold  for  all 
p  >  1,  t,  s  £  R+,  a,  (3  €  R,  bounded  measurable  functions  /,  g: 

1 ■  \\Ptf\\LpM  <  ||/||lp(m)  :=  h(fp)1/p  (contraction). 

2.  Ptfaf  +  (3g)  =  aPtf  +  f3Ptg  p-a.s.  (linearity). 

3.  Pt.+sf  =  PtPsf  p-a.s.  (semigroup  property), 
f.  Pt  1  =  1  p-a.s.  (conservativeness) . 

In  particular,  {Pt}t6R+  defines  a  semigroup  of  linear  operators  on  Lp(p). 
Proof.  Assume  that  X0  ~  p.  To  prove  contraction,  note  that 

\\Ptf\\pLPM  =  E[E[/(At)|Aof]  <  E[E[f(Xtr\X0}}  =  \\f\\pLP(fl), 
where  we  have  used  Jensen’s  inequality.  Linearity  follows  similarly  as 
E  [af(Xt)  +  pg(Xt)\X0]  =  aE[f(Xt)\X0]  +  pE[g(Xt)\X0\. 
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To  prove  the  semigroup  property,  note  that 

E\f(Xt+a)\X0]  =  E[E[/(Xt+s)|{Xr}r<t]|X0]  =  E[Ps/(Xt)|X0]. 

The  last  property  is  trivial.  □ 

Remark  2.8.  Let  p  be  a  stationary  measure.  In  view  of  Lemma  2.7,  it  is  easily 
seen  that  the  definition  and  basic  properties  of  Ptf  make  sense  not  only  for 
bounded  measurable  functions  /,  but  also  for  every  /  £  L 1(p).  From  now  on, 
we  will  assume  the  Ptf  is  defined  in  this  manner  for  every  f  £  L1(/r). 

As  an  illustration  of  these  basic  properties,  let  us  prove  the  following  ele¬ 
mentary  observation.  In  the  sequel,  we  will  write  Var M(/)  :=  p(f2)  —  p(f)2- 

Lemma  2.9.  Let  p  be  a  stationary  measure.  Then  t  i— >  Var ^(Ptf)  is  a  de¬ 
creasing  function  of  time  for  every  function  f  £  L2  ( p ) . 

Proof.  Note  that 

Var M(Pt/)  =  || Ptf  -  tifWl^  =  || Pt(f  -  pf) II W)  =  \\Pt-.P,(f  -  Tf)\\l*W 

<  II Ps(f  -  m/)I||2(m)  =  II Psf  -  hf\\h(„)  =  Var p(Psf) 

for  every  0  <  s  <  t.  □ 

We  now  turn  to  an  important  notion  for  Markov  processes  in  continuous 
time.  If  you  are  familiar  with  Markov  chains  in  discrete  time  with  a  finite  state 
space,  you  will  be  used  to  the  idea  that  the  dynamics  of  the  chain  is  defined  in 
terms  of  a  matrix  of  transition  probabilities.  This  matrix  describes  with  what 
probability  the  chain  moves  from  one  state  to  another  in  one  time  step,  and 
forms  the  basic  ingredient  in  the  analysis  of  the  behavior  of  Markov  chains. 
This  idea  does  not  make  sense  in  continuous  time,  as  a  Markov  process  evolves 
continuously  and  not  in  individual  steps.  Nonetheless,  there  is  an  object  that 
plays  the  analogous  role  in  continuous  time,  called  the  generator  of  a  Markov 
process.  We  will  first  describe  the  general  notion,  and  then  investigate  the 
finite  state  space  case  as  an  example  (in  which  case  the  generator  can  be 
interpreted  as  a  matrix  of  transition  rates  rather  than  probabilities). 

From  now  on,  we  will  fix  a  Markov  process  with  stationary  measure  /. i  and 
consider  {P(}teR+  as  a  semigroup  of  linear  operators  on  L2(p). 

Definition  2.10  (Generator).  The  generator  ££  is  defined  as 


2z? /  :=  lim 

no 


Ptf -f 

t 


for  every  f  £  L2([T)  for  which  the  above  limit  exists  in  L2(fT).  The  set  of  f 
for  which  T£ f  is  defined  is  called  the  domain  Dom(Jzf)  of  the  generator,  and 
defines  a  linear  operator  from  Dom(_Sf)  C  L2(fT)  to  L2(fT). 
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Remark  2.11  (Warning).  For  Markov  processes  whose  sample  paths  are  of 
pure  jump  type  (i.e.,  piecewise  constant  as  a  function  of  time)  it  is  often  the 
case  that  Dom(«5f)  =  L2(g).  This  is  the  simplest  setting  for  the  theory  of 
Markov  processes  in  continuous  time,  and  here  many  computations  can  be 
done  without  any  technicalities.  On  the  other  hand,  for  Markov  processes 
with  continuous  sample  paths  (such  as  Brownian  motion,  for  example),  it  is 
an  unfortunate  fact  of  life  that  Dom(Jz?)  C  L2(p).  In  this  setting,  a  rigorous 
treatment  of  semigroups,  generators,  and  domains  requires  functional  ana¬ 
lytic  machinery  that  is  not  assumed  as  a  prerequisite  for  this  course.  While 
we  should  therefore  ideally  restrict  attention  to  the  pure  jump  case,  many 
important  applications  (for  example,  the  proof  of  the  Poincare  inequality  for 
Gaussian  variables)  will  require  the  use  of  continuous  Markov  processes. 

Fortunately,  it  turns  out  that  domain  problems  prove  to  be  of  a  purely 
technical  nature  in  all  the  applications  that  we  will  encounter:  results  that  we 
will  derive  for  the  case  Dom(_Sf)  =  L2(g)  will  be  directly  applicable  even  when 
this  condition  fails.  While  a  rigorous  proof  would  require  to  check  carefully 
that  no  domain  issues  arise,  addressing  such  issues  would  take  significant  time 
and  does  not  provide  much  insight  into  the  high-dimensional  phenomena  that 
are  of  interest  in  this  course.  As  a  compromise,  we  will  therefore  generally 
ignore  domain  problems  and  assume  implicitly  that  Dom(Jzf)  =  L2(/i)  when 
deriving  general  results,  while  we  will  still  apply  these  results  in  more  general 
cases.  The  interested  reader  should  be  aware  when  a  shortcut  is  being  taken, 
and  refer  to  the  literature  for  a  careful  treatment  of  such  technical  issues. 


How  can  one  use  the  generator  2zf?  We  have  defined  the  generator  in 
terms  of  the  semigroup;  however,  it  is  in  fact  possible  to  define  the  semigroup 
in  terms  of  the  generator,  in  analogy  to  the  definition  of  a  discrete  Markov 
chain  in  terms  of  its  transition  probability  matrix.  To  see  this,  note  that 


d 

dt 


Ptf  =  lim 

<510 


Pt+sf  -  Ptf 
6 


=  lim  Pt 

<510 


Pt&f- 


Thus  Pt  can  be  recovered  as  the  solution  of  the  Kolmogorov  equation 


dt 


Ptf  =  Pt-Pf,  Pof  =  /■ 


This  computation  could  also  have  been  performed  in  a  different  order: 

^ Ptf  =  lim  Pt+S*~  Pt *  =  lim  PsPtf  ~  =  ££Pt  f. 

dt  <510  S  <510  S 


Thus  we  have  demonstrated  a  basic  property:  the  generator  and  the  semigroup 
commute,  that  is,  -SfP*  =  [These  statements  are  entirely  clear  when 

Dom(2z?)  =  L2(/i),  and  must  be  given  a  careful  interpretation  otherwise.] 

Example  2.12  (Finite  state  space).  Let  (Xt)tSR+  be  a  Markov  process  with 
values  in  a  finite  state  space  Xt  e  {1,  Such  processes  are  typically 

described  in  terms  of  their  transition  rates  Xij  >  0  for  i  ^  j: 


22 


2  Variance  bounds  and  Poincare  inequalities 


P[Xt+5  =  j\Xt  =  i]  =  Xij6  +  o(6)  for  i  ^  j. 

Evidently,  the  transition  rates  \j  describe  the  infinitesimal  rate  of  growth  of 
the  probability  of  jumping  from  state  i  to  state  j  (informally,  if  Xt  =  i,  then 
the  probability  that  Xt+dt  —  j  is  Xijdt). 

Let  us  organize  the  transition  probabilities  qtdj  =  P[Xt  =  j\Xo  =  i ]  and 
rates  A ^  into  matrices  Qt  =  ( qt,ij)i<i,j<d  and  A  =  ( Xij)i<ij<d ,  respectively, 
where  we  define  the  diagonal  entries  of  A  as  Xu  =  —  \j  <  0-  Then 

fim  ~  qo’lj  =  X,j 

i|0  t 

for  every  1  <  i,  j  <  d  (the  diagonal  entries  Xu  were  chosen  precisely  to  enforce 
the  law  of  total  probability  JT  qt  ij  =  1).  In  particular,  we  have 

-^/(*)  =  —  t  gQ'iJ  =  Ab/(i)  =  (Af)i> 

;  3  =  1  3=1 

where  we  identify  the  function  /  with  the  vector  (/(l), . . . ,  f(d))  £  R.d.  We 
therefore  conclude  that  the  generator  of  a  Markov  process  in  a  finite  state 
space  corresponds  precisely  to  the  matrix  of  transition  rates.  The  Kolmogorov 
equation  now  reduces  to  the  matrix  differential  equation 

~nQt  —  Qt.A ,  Qo  =  I. 

dt 

This  differential  equation  is  the  basic  tool  for  computing  probabilities  of  finite 
state  space  Markov  processes.  The  solution  is  in  fact  easily  obtained  as 

Qt  =  etA, 

from  which  we  readily  see  why  Pt  and  Jzf  must  commute. 

The  above  example  provides  some  intuition  for  the  notion  of  a  generator. 
Further  examples  of  Markov  semigroups  will  be  given  in  the  next  section. 

Remark  2.13.  In  analogy  with  the  above  example,  we  can  formally  express  the 
relation  between  the  semigroup  and  generator  of  a  Markov  process  as  Pt  = 
e4^.  This  expression  is  readily  made  precise  in  the  case  Dom(Jzf)  =  L2{^l) 
by  interpreting  e4^  as  a  power  series.  While  this  does  not  work  in  the  case 
Dom(^f)  C  L2(fi),  the  intuition  extends  also  to  this  setting;  however,  in  this 
case  the  meaning  of  the  exponential  function  must  be  carefully  defined. 

We  conclude  this  section  by  introducing  one  more  fundamental  idea  in 
the  theory  of  Markov  processes.  Recall  that  we  have  defined  semigroup  Pt  as 
a  family  of  linear  operators  on  T2(/i).  The  latter  is  a  Hilbert  space,  and  we 
denote  its  inner  product  as  (f,g)n  :=  n{fg)  (so  that  =  </,/u 
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Definition  2.14  (Reversibility).  The  Markov  semigroup  Pt  with  stationary 
measure  p  is  called  reversible  if  ( f,Pt.g)n  =  (Ptf^g)^  for  every  f,g  £  L2(/i). 

Thus  the  Markov  process  is  reversible  if  the  operators  Pt  are  self-adjoint 
on  L2(p).  Equivalently,  as  Pt  =  e*-^ ,  the  Markov  process  is  reversible  if  its 
generator  Jz?  is  self-adjoint.  The  reversibility  property  has  a  probabilistic  in¬ 
terpretation:  if  the  Markov  property  is  reversible,  then  (assuming  Xq  ~  p) 

(PtLg)u  =  (f,Ptg)n  =  nf(X0)E[g(Xt)\X0}] 

=  E[f(Xo)g(Xt)}  =  E[E[f(X0)\Xt]g(Xt)} 

for  every  f,g  £  L2(p),  so  that  in  particular 

Ptf(x)  :=  E[/(Xt)|*0  =  x]  =  E[f(X0)\Xt  =  x]. 

This  implies  that  when  the  Markov  process  (^t)t<=[o,a]  is  viewed  backwards  in 
time  {Xa-t)te[o,a]i  it  has  the  same  law:  that  is,  the  law  of  the  Markov  process 
is  invariant  under  time  reversal;  hence  the  name  reversibility. 

We  will  see  in  the  following  section  that  reversible  Markov  processes  are 
the  most  natural  objects  connected  to  Poincare  inequalities  (and  to  other 
functional  inequalities  that  we  will  encounter  in  later  chapters).  However, 
the  notion  of  time  reversal  will  not  play  any  role  in  our  proofs.  Rather,  for 
reasons  that  will  become  evident  in  the  next  section,  the  self-adjointness  of 
the  generator  will  allow  us  to  obtain  a  very  complete  characterization  of 
exponential  convergence  of  the  Markov  semigroup  to  the  stationary  measure. 

Example  2.15  (Finite  state  space  continued).  In  the  setting  of  Example  2.12, 
it  is  evident  that  the  Markov  process  is  reversible  if  and  only  if 
d  d 

'y  '  /''/< •'/.;.'/./  =  y  '  g-jgjAjifi 

i,j=l  i,j=~L 

for  all  f,g£  or  equivalently 

HiAij  =  fijAji  for  all  i,  j  £  {1, . . . ,  d}, 

where  /i  denotes  the  stationary  measure  of  the  Markov  process.  The  latter 
condition  is  often  called  “detailed  balance”  in  the  physics  literature. 

Problems 

2.6  (Some  elementary  identities).  Let  Pt  be  a  Markov  semigroup  with 
generator  2z?  and  stationary  measure  p.  Prove  the  following  elementary  facts: 

a.  Show  that  /x(2z? /)  =  0  for  every  /  £  Dom(Jzf). 

b.  If  (f) :  R  — >  R  is  convex,  then  Pt(f>(f)  >  <f>(Ptf)  when  /,  </>(f)  £  L2(p). 

c.  If  (j) :  R  — >  R  is  convex,  then  Af(j)(f)  >  (j)'(f)Aff  when  /,  (f>(f)  £  Dom(Jf). 

d.  Let  /  £  Dom(Jzf).  Show  that  the  following  process  is  a  martingale: 

M{  :=f(Xt)~  f  &f{XB)ds 

Jo 
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2.3  Poincare  inequalities 

Throughout  this  section,  we  fix  a  Markov  semigroup  Pt  with  generator  2z?  and 
stationary  measure  p.  As  was  discussed  in  the  previous  section,  the  stationary 
measure  describes  the  “steady-state”  behavior  of  the  Markov  process:  that  is, 
if  A0  ~  p,  then  Xt  ~  p  for  all  times  t.  It  is  natural  to  ask  whether  the  Markov 
process  will  in  fact  eventually  end  up  in  its  steady  state  even  if  it  is  not  started 
there,  but  rather  at  some  fixed  initial  condition  Xo  =  x:  that  is,  is  it  true  that 

’E[f(Xt)\X0  =  x]  ->  pf  as  f  — >  oo? 

If  this  is  the  case,  the  Markov  process  is  said  to  be  ergodic.  There  are  various 
different  notions  of  ergodicity  in  the  theory  of  Markov  processes;  as  we  are 
working  in  L2(p),  the  following  will  be  natural  for  our  purposes. 

Definition  2.16  (Ergodicity).  The  Markov  semigroup  is  called  ergodic  if 
Ptf  —*  nf  in  L2(p)  as  t  — >  oo  for  every  f  €  L2(p). 

Recall  that  a  Poincare  inequality  for  p  is,  informally,  of  the  form 

“  variance(/)  <  E[  || gradient (/)|| 2  ].  ” 

At  first  sight,  such  an  inequality  has  nothing  to  do  with  Markov  processes. 
Remarkably,  however,  the  validity  of  a  Poincare  inequality  for  p  turns  out  to  be 
intimately  related  to  the  rate  of  convergence  of  an  ergodic  Markov  process  for 
which  p  is  the  stationary  distribution.  Still  informally,  we  have  the  following: 

A  measure  p  satisfies  a  Poincare  inequality  for  a  certain  notion  of 
“ gradient  ”  if  and  only  if  an  ergodic  Markov  semigroup  associated  to 
this  “ gradient ”  converges  exponentially  fast  to  p. 

The  following  definition  and  result  makes  this  principle  precise. 

Definition  2.17  (Dirichlet  form).  Given  a  Markov  process  with  generator 
T£  and  stationary  measure  p,  the  corresponding  Dirichlet  form  is  defined  as 

£{f,g)  =  -{f,Sfg)»- 

Theorem  2.18  (Poincare  inequality).  Let  Pt  be  reversible  ergodic  Markov 
semigroup  with  stationary  measure  p.  The  following  are  equivalent  given  c  >  0: 

1.  Var„[/1  <  cE(f,  f)  for  all  f  (Poincare  inequality). 

2.  \\Ptf  -  pfWmri  <  e~t/c\\f  -  pf\\L2{fl)  for  all  f,t. 

3.  E(Ptf,  Ptf)  <  e-2t/c£(/,  /)  for  all  f,  t. 

4-  For  every  f  there  exists  «;(/)  such  that  \\Ptf  —  P./||l2(p)  <  «(/)e_t/c . 

5.  For  every  f  there  exists  «;(/)  such  that  £(Ptf,Ptf)  <  n{f)e~2t^c. 

Remark  2.19.  As  will  be  seen  in  the  proof  of  this  Theorem,  the  implications 
54=3=>l<t=>2=4-4  remain  valid  even  when  Pt  is  not  reversible.  The 
remaining  implications  5  =>  3,  4  =4-  2  and  2  =>  3  require  reversibility. 
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At  this  point,  the  interpretation  of  Theorem  2.18  is  probably  far  from 
clear.  There  are  several  questions  we  must  address: 

•  Why  do  we  call  VarM[/]  <  c£(/, /)  a  Poincare  inequality?  In  what  sense 
can  £(/,/)  be  interpreted  as  an  “expected  square  gradient”  of  /? 

•  It  there  any  relation  between  Theorem  2.18  and  the  discrete  Poincare 
inequalities  that  we  already  derived  in  section  2.1? 

•  Why  should  we  expect  any  connection  between  Poincare  inequalities  and 
Markov  processes  in  the  first  place? 

The  quickest  way  to  get  a  feeling  for  the  first  two  questions  is  to  consider  some 
illminating  examples.  To  this  end,  we  will  devote  the  remainder  of  this  section 
to  developing  two  applications  of  Theorem  2.18.  First,  we  will  prove  one  of 
the  most  important  examples  of  a  Poincare  inequality,  the  Gaussian  Poincare 
inequality,  using  the  machinery  of  Theorem  2.18.  Along  the  way,  we  will  in¬ 
troduce  an  important  Markov  process,  the  Ornstein-Uhlenbeck  process,  that 
will  appear  again  in  later  chapters.  Second,  we  will  show  that  the  tensoriza- 
tion  inequality  that  we  already  proved  in  Theorem  2.3  is  itself  a  special  case 
of  Theorem  2.18;  this  again  requires  the  introduction  of  an  suitable  Markov 
process.  Of  course,  this  is  not  the  easiest  proof  of  the  tensorization  inequality, 
and  it  is  not  suggested  that  Theorem  2.18  should  be  used  when  an  easier  proof 
is  available.  Rather,  this  example  highlights  that  Theorem  2.18  is  not  distinct 
from  the  inequalities  that  we  developed  in  section  2.1,  but  rather  provides  a 
unified  framework  for  all  the  Poincare  inequalities  that  we  encounter. 

The  proof  of  Theorem  2.18  will  be  postponed  to  the  next  section.  When 
we  begin  developing  the  proof,  it  will  quickly  become  apparent  why  Poincare 
inequalities  are  connected  to  Markov  processes,  and  why  Var M[/]  <  c£(/,  /)  is 
the  “right”  notion  of  a  Poincare  inequality.  The  ideas  used  in  the  proof  are  of 
interest  in  their  own  right  and  can  be  used  to  prove  other  interesting  results. 

Remark  2.20.  The  properties  2-5  of  Theorem  2.18  should  all  be  viewed  as 
different  notions  of  exponential  convergence  of  the  Markov  semigroup  Pt  to 
the  stationary  measure  p.  Properties  2  and  4  measure  directly  the  rate  of 
convergence  of  Ptf  to  pf  in  L2(p)  (cf.  Definition  2.16).  On  the  other  hand, 
properties  3  and  5  measure  the  rate  of  convergence  of  the  “gradient”  of  Ptf 
to  zero.  As  ergodicity  implies  that  Pt.f{x )  becomes  insensitive  to  x  as  t  — >  oo 
(that  is,  the  Markov  process  “forgets”  its  initial  condition),  the  latter  is  also 
a  natural  formulation  of  the  ergodicity  property.  The  properties  4  and  5  are 
often  easier  to  prove  than  properties  2  and  3,  as  they  only  require  control  of 
the  rate  of  convergence  and  not  of  the  constant  in  the  inequality. 

Remark  2.21.  Let  p  be  a  measure  for  which  we  would  like  to  prove  a  Poincare 
inequality.  In  order  to  apply  Theorem  2.18,  we  must  construct  a  suitable 
Markov  process  for  which  p  is  the  stationary  measure.  There  is  not  a  unique 
way  to  do  this:  there  are  many  different  Markov  processes  that  admit  the  same 
stationary  measure  p.  However,  each  Markov  process  gives  rise  to  a  different 
Dirichlet  form  £(/,/),  and  thus  to  a  Poincare  inequality  for  p  with  respect  to  a 
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different  notion  of  gradient!  By  choosing  different  Markov  processes,  Theorem 
2.18  therefore  provides  us  with  a  flexible  mechanism  to  prove  a  whole  family 
of  different  Poincare  inequalities  for  the  same  distribution  p. 

Conversely,  Theorem  2.18  can  be  used  in  the  opposite  direction.  Suppose 
that  we  are  interested  in  ergodicity  of  a  given  Markov  process  with  stationary 
measure  p.  If  we  can  prove,  by  some  means,  that  p  satisfies  a  Poincare  inequal¬ 
ity  with  respect  to  the  Dirichlet  form  induced  by  the  given  Markov  process, 
then  we  have  immediately  established  exponential  convergence  of  the  Markov 
process  to  its  stationary  measure.  This  is  important  in  many  applications, 
including  nonequilibrium  statistical  mechanics  and  in  the  analysis  of  Markov 
Chain  Monte  Carlo  algorithms  for  sampling  from  the  stationary  measure  p. 

We  now  turn  to  the  examples  announced  above.  We  begin  with  an  impor¬ 
tant  inequality  that  has  many  applications:  the  Gaussian  Poincare  inequality. 

Example  2.22  ( Gaussian  Poincare  inequality) .  Our  aim  is  to  obtain  a  Poincare 
inequality  for  the  standard  Gaussian  distribution  p  =  iV(0, 1)  in  one  dimen¬ 
sion  (we  can  subsequently  use  tensorization  to  extend  to  higher  dimensions). 
Of  course,  there  is  no  unique  Poincare  inequality:  for  example,  the  trivial 
Lemma  2.1  applies  to  the  Gaussian  distribution  as  it  does  to  any  other.  How¬ 
ever,  we  will  see  that  for  the  Gaussian,  we  can  obtain  a  nontrivial  Poincare 
inequality  with  respect  to  the  classical  calculus  notion  of  gradient.  This  in¬ 
equality  is  usually  referred  to  as  the  Gaussian  Poincare  inequality. 

By  Theorem  2.18,  the  key  to  obtaining  a  Poincare  inequality  for  p  with  a 
specific  notion  of  gradient  is  to  construct  a  Markov  process  whose  Dirichlet 
form  corresponds  to  the  desired  notion  of  gradient  and  for  which  p  is  the 
stationary  distribution.  For  the  Gaussian  distribution,  the  appropriate  Markov 
process  is  the  Ornstein-Uhlenbeck  process,  which  is  one  of  the  most  important 
tools  in  the  study  of  Gaussian  distributions  and  which  we  will  encounter  again 
in  later  chapters.  Given  a  standard  Brownian  motion  (Wt)t£R, ,  the  Ornstein- 
Uhlenbeck  process  can  be  defined  as 

Xt  =  e~tX0  +  e~tWe2t_1,  X0  X  W. 

It  is  evident  that  if  Xq  ~  IV(0, 1),  then  Xt  ~  N( 0, 1)  for  all  t  £  R+.  Let  us 
collect  some  basic  properties  of  the  Ornstein-Uhlenbeck  process. 

Lemma  2.23  (Ornstein-Uhlenbeck  process).  The  process  (Xt)te r+  de¬ 
fined  above  is  a  Markov  process  with  semigroup 

Ptf(x)  =  E[f(e~tx  +  Vl-e-2t0},  0,1). 

The  process  admits  p  =  N( 0, 1)  as  its  stationary  measure  and  is  ergodic. 
Moreover,  its  generator  and  Dirichlet  form  are  given  by 

&f(x)  =  -xf'(x)  +  f"{x),  £(/,5)  =  (f'ig')n- 


In  particular,  the  Ornstein-Uhlenbeck  process  is  reversible. 
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Before  we  can  prove  this  result,  we  need  an  elementary  property  of  the 
Gaussian  distribution:  the  Gaussian  integration  by  parts  formula. 

Lemma  2.24  (Gaussian  integration  by  parts).  If  £  ~  _ZV(0,1),  then 

E[e/(0]=E[/(0]. 


Proof.  If  /  is  smooth  with  compact  support,  then  we  have 


e-P/2 

fix) — -]=- dx  =  — 


fix) 


d  e-*2/2 
dx  ffjr 


dx 


by  integration  by  parts,  and  the  result  follows  readily.  We  can  now  extend  to 
any  /  with  £/(£)>  Z^O  £  L'if)  by  a  routine  approximation  argument.  □ 

Proof  (Lemma  2.23).  Let  s  <  t.  By  the  definition  of  Xt,  we  have 

Xt  =  e~^Xs  +  e-fW^  -  We a._0 
=  e~(t~s)Xs  + 


where  £  =  (We2i_1— We2S_1)/\/e2t  —  e2s  ~  iV(0, 1)  is  independent  of  {Xr}r<s. 
It  follows  immedately  that  we  can  write 


E[f(Xt)\{Xr}r<s]  =  Pt_J(Xs), 


with  Ptf  as  defined  in  the  statement  of  the  Lemma.  In  particular,  ( Xt)t>o 
satisfies  the  Markov  property.  Moreover,  it  is  evident  by  inspection  that  n  = 
N{ 0, 1)  is  stationary  and  that  the  semigroup  is  ergodic. 

With  the  semigroup  in  hand,  we  can  now  compute  the  generator  and  the 
Dirichlet  form.  To  compute  the  generator,  note  that 


dtP */(*>  =  E 


f'(e  tx  +  Vl  —  e~2tf) 


:  f  —  e  tx 


fl  -  e~2t 

=  E[— e~txf\e~tx  +  Vl  —  e~2tt;)  +  e~2tf"{e~tx  +  Vl  —  e_2t^)], 


where  we  have  used  Lemma  2.24  in  the  second  line.  We  therefore  have 


dt 


Ptfix)  =  \  -  x—  + 


X 


dx  dx 2 J 


Ptfix). 


Letting  t  J,  0  yields  the  expression  for  given  in  the  statement  of  the  Lemma. 
To  compute  the  Dirichlet  form,  it  suffices  to  note  that 


eifg)  =  ~(f,j?g)»  =  nmw®  -  s"(oi] = nnzwm, 


where  we  have  used  Lemma  2.24  once  more.  Finally,  (/,  J^g)^  =  {J^f^g)^  as 
£(/,  g)  is  symmetric,  so  the  Ornstein-Uhlenbeck  process  is  reversible.  □ 
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Remark  2.25.  Our  definition  of  the  Ornstein-Uhlenbeck  process  may  seem 
a  little  mysterious.  Perhaps  a  more  intuitive  definition  of  the  Ornstein- 
Uhlenbeck  process  is  as  the  solution  of  the  stochastic  differential  equation 

dXt  =  —Xtdt+  \/2  dBt , 

where  (Bt)teR+  is  standard  Brownian  motion:  that  is,  the  Ornstein-Uhlenbeck 
process  is  obtained  by  subjecting  a  Brownian  motion  to  linear  forcing  that 
keeps  it  from  going  off  to  infinity.  While  this  approach  is  more  insightful  and 
is  more  readily  generalized  to  other  distributions,  our  elementary  approach 
has  the  advantage  that  it  avoids  the  use  of  stochastic  calculus. 

From  Lemma  2.23,  it  follows  immediately  that 

£(/,  /)  =  ii/'ii  W)  =  E[{/'(0}2],  e  ~  n{o,  i). 

Thus  the  Dirichlet  form  for  the  Ornstein-Uhlenbeck  process  is  precisely  the 
expected  square  gradient  for  the  classical  calculus  notion  of  gradient!  Thus  an 
inequality  of  the  form  VarM[/]  <  c£(/,  /)  is  indeed  a  Poincare  inequality  in  the 
most  classical  sense.  By  Theorem  2.18,  proving  such  an  inequality  is  equivalent 
to  proving  exponential  ergodicity  of  the  Ornstein-Uhlenbeck  process.  With 
Lemma  2.23  in  hand,  this  is  a  remarkably  easy  exercise. 

Theorem  2.26.  Let  p  =  iV(0, 1).  Then  VarM[/]  <  ||/'|||2^. 

This  is  the  Gaussian  Poincare  inequality  in  one  dimension. 

Proof.  It  follows  immediately  from  the  expression  for  Ptf  in  Lemma  2.23  that 

y -Ptf{x)  =  e~tPtf'(x). 
ax 

Thus 


£{Ptf,Ptf)  =  HPtfYWl^  =  e-2t\\Ptf\\2L2W 
<e-2t\\f\\hw=e-2t£(f,f). 

The  result  follows  by  the  implication  3  =>  1  of  Theorem  2.18.  □ 

Remark  2.21.  Let  us  emphasize  once  more  that  there  is  nothing  special  about 
the  Ornstein-Uhlenbeck  process  per  se  in  the  context  of  Theorem  2.18:  there 
are  many  Markov  processes  for  which  p  =  N( 0,1)  is  stationary.  Different 
Markov  processes  could  be  used  to  prove  different  Poincare  inequalities  for 
the  Gaussian  distribution  for  different  notions  of  gradient.  What  singles  out 
the  Ornstein-Uhlenbeck  process  is  that  its  Dirichlet  form  £(/,/)  =  !|/,|||2(M) 
is  precisely  given  in  terms  of  the  classical  calculus  notion  of  gradient,  which 
provides  a  particularly  useful  tool  in  many  applications. 
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Having  proved  the  Gaussian  Poincare  inequality  in  one  dimension,  we  im¬ 
mediately  obtain  an  n-dimensional  inequality  by  tensorization.  As  this  is  a 
very  useful  inequality  in  applications,  let  us  state  it  as  a  theorem.  [We  could 
also  have  proved  this  directly  without  tensorization  using  an  n-dimensional 
Ornstein-Uhlenbeck  process,  but  this  does  not  add  much  additional  insight.] 

Corollary  2.28  (Gaussian  Poincare  inequality).  Let  Xi, . . .  ,Xn  be  inde¬ 
pendent  Gaussian  random  variables  with  zero  mean  and  unit  variance.  Then 

Var  [f{Xu  ...,Xn)\<  E[||  V/(Xl,  . . . ,  Xn)||2]. 

We  now  turn  to  our  second  example:  we  will  show  that  the  tensorization 
inequality  of  Theorem  2.3  is  a  special  case  of  Theorem  2.18.  Thus  the  con¬ 
nection  between  Poincare  inequalities  and  Markov  semigroups  captures  in  a 
unified  framework  all  of  the  inequalities  that  we  have  seen  so  far. 

Example  2.29  (Tensorization  revisited).  Let  p  =  p±  <S>  ■  •  •  <S>  pn  be  any  product 
measure.  We  aim  to  investigate  the  tensorization  inequality  of  Theorem  2.3 
from  the  viewpoint  of  Theorem  2.18.  To  this  end,  we  begin  by  constructing  a 
Markov  process  for  which  p  is  stationary  and  whose  Dirichlet  form  corresponds 
to  the  right-hand  side  of  the  tensorization  inequality. 

Let  Xt  =  (X), . . .  ,X")t6R+  be  a  random  process  constructed  as  follows. 
To  each  coordinate  i  =  l,...,n,  we  attach  an  independent  Poisson  process 
N)  with  unit  rate.  The  Poisson  process  should  be  viewed  as  a  random  clock 
attached  to  each  coordinate  that  “ticks”  whenever  N)  jumps.  The  process 
(Xt)tgR  is  now  constructed  by  the  following  mechanism: 

•  Draw  X0  ~  p  independently  from  the  Poisson  process  N  =  (TV1 , . . . ,  Nn). 

•  Each  time  N)  jumps  for  some  i,  replace  the  current  value  of  X\  by  an 
independent  sample  from  pi  while  keeping  the  remaining  coordinates  fixed. 

As  the  Poisson  process  has  independent  increments,  it  is  easily  verified  that 
(Xt)te R+  satisfies  the  Markov  property  and  that  p  is  stationary. 

Let  us  now  compute  the  semigroup  of  (Ai)teR+.  By  construction, 

Ptf(x)  =  V[f(Xt)\X0  =  x)  = 

P[iV l  >  0  for  i  e  I,  Nl  =  0  for  i  I]  f  f(x i, . . . ,  xn)  pi(dx.i)  = 

7C{l,...,n}  ^  i&I 

y  (1  -  e-t)|7|e-t(n-|/|)  f  f(x!,...,xn)  Y[pi(dxi). 

7C{l,...,n}  ^  ie7 

In  particular,  we  can  compute  the  generator  as 

=  lim  Ptf  ~  f  =  -  V  S,  f, 

J  no  t  ^ 

i—1 

where  we  have  introduced  the  notation 


30 


2  Variance  bounds  and  Poincare  inequalities 


Sif(x)  :=  f(x)  -Jf(x1,...,Xi-i,z,xi+1,..,,xn)ni(dz). 

Finally,  let  us  compute  the  Dirichlet  form 

n  p 

fSigdfi  =  £  /  Sifdigdu, 

i—1  J 

where  we  have  used  that  /  hdigdg  =  0  if  h(x)  does  not  depend  on  a As 
£(/,  g)  is  symmetric,  it  follows  that  our  Markov  process  is  reversible. 

Now  note  that 


n  p 

£(/,<?)  =  £  J 


Up  Up 

£(/,  /)  =  £  /  (^/)2  =  £  /  Vari/  d^- 

i=l  J  i=  1  J 

Thus  the  tensorization  inequality  of  Theorem  2.3  can  be  expressed  as 

Var„[/]  <£(/,/), 

and  we  therefore  conclude  that  tensorization  is  nothing  but  a  special  case 
of  Theorem  2.18.  In  fact,  given  that  we  already  proved  the  tensorization  in¬ 
equality,  we  could  now  invoke  Theorem  2.18  to  conclude  immediately  that  our 
Markov  process  is  exponentially  ergodic  in  the  sense  that 

II Ptf  ~  m/IU2(m)  <  e_tH/  - 

Conversely,  if  we  can  give  a  direct  proof  of  exponential  ergodicity  of  our 
Markov  process,  then  we  obtain  by  Theorem  2.18  an  alternative  proof  of  the 
tensorization  inequality.  Let  us  provide  such  a  proof  for  sake  of  illustration. 
From  the  explicit  formula  for  Ptf  above,  it  follows  that 

SiPtf  =  e-t  £(1  -  e_t)|7|e_t("_1_|/|)  f  Sif(x i, . . .  ,xn) 

IjH  d  iel 

Evidently  each  term  in  the  sum  has  L2(/i)- norm  at  most  ||5j/|| l2(h)i  so 

n 

£(Ptf,Ptf)  =  £  \\SiPtf\\2L2W  <  nU)e~2t 

i—l 

for  some  k(/)  <  oo  for  every  /  £  L2(fi).  The  tensorization  inequality  of 
Theorem  2.3  therefore  follows  from  the  implication  5  =>  1  of  Theorem  2.18. 


Problems 

2.7  (Carre  du  champ).  We  have  interpreted  the  Dirichlet  form  £(/,/)  as 
a  general  notion  of  “expected  square  gradient”  that  arises  in  the  study  of 
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Poincare  inequalities.  There  is  an  analogous  quantity  r(f,f)  that  plays  the 
role  of  “square  gradient”  in  this  setting  (without  the  expectation).  In  good 
probabilistic  tradition,  it  is  universally  known  by  its  French  name  carre  du 
champ  (literally,  “square  of  the  field”).  The  carre  du  champ  is  defined  as 

r(f,  g )  :=  \{&{fg)  -  f&g  -  g&f] 

in  terms  of  the  generator  ££  of  a  Markov  process  with  stationary  measure  9. 

a.  Show  that  £(/,/)  =  /  r(f,f)dg,  and  that  £(/,  g)  =  J  r(f,g)dg  if  the 
Markov  process  is  in  addition  reversible. 

b.  Show  that  r(f,  /)  >  0,  so  it  can  indeed  be  interpreted  as  a  square. 

Hint:  use  Pt(f 2)  >  (Pt/)2  and  the  definition  of  . 

c.  Prove  the  Cauchy-Schwarz  inequality  r(f,g)2  <  r(f,  f)T(g,  g). 

Hint:  use  that  P(/  +  tg,f  +  tg)  >  0  for  all  t  €  R. 

d.  Compute  the  carre  du  champ  in  the  various  examples  of  Poincare  inequali¬ 
ties  encountered  in  this  chapter,  and  convince  yourself  that  it  should  indeed 
be  interpreted  as  the  appropriate  notion  of  “square  gradient”  in  each  case. 

2.8  (Gaussian  Poincare  inequality).  The  goal  of  this  problem  is  to  develop 
some  simple  consequences  and  insights  for  the  Gaussian  Poincare  inequality. 

a.  Let  X-[. . . . ,  Xn  be  i.i.d.  standard  Gaussians.  Show  that  if  /  is  L-Lipschitz, 
that  is,  | f (x)  -  f(y) \  <  L\\x  -  y\\,  then  Var[/(Xi, . . .  ,Xn)]  <  L 2. 

Remark.  The  power  of  the  above  inequality  is  its  dimension- free  nature:  it 
depends  only  on  the  degree  of  smoothness  of  /  and  not  on  the  dimension  n. 

b.  Let  X  ~  N( 0,  X)  be  an  n-dimensional  centered  Gaussian  vector  with  arbi¬ 
trary  covariance  matrix  X.  Prove  the  following  useful  identity: 


Var 


max  Xi  <  max  Var[X,]. 


Hint:  write  X  =  Xxl2Y  where  Yi,.. .  ,Yn  are  i.i.d.  standard  Gaussians. 

c.  By  a  miracle,  it  is  possible  to  derive  the  Gaussian  Poincare  inequality  from 
the  bounded  difference  inequality  of  Corollary  2.4.  To  this  end,  let  £ji  be 
i.i.d.  symmetric  Bernoulli  variables.  By  the  central  limit  theorem, 


. ^ 


f(Xu...,Xn) 


in  distribution  as  k  — >  00  when  /  is  a  bounded  continuous  function  and 
Xi,...,Xn  are  i.i.d.  standard  Gaussians.  Apply  the  bounded  difference 
inequality  to  the  left-hand  side  and  use  Taylor  expansion  to  provide  an 
alternative  proof  the  Gaussian  Poincare  inequality  of  Corollary  2.28. 
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Remark.  The  central  limit  theorem  proof  of  the  Gaussian  Poincare  inequality 
is  very  specific  to  the  Gaussian  distribution.  While  it  works  in  this  particular 
case,  the  proof  we  have  given  above  using  the  Ornstein-Uhlenbeck  semigroup  is 
much  more  insightful  and  can  be  extended  to  other  distributions  (for  example, 
to  log-concave  distributions  as  in  Problem  2.13  below). 

2.9  (Exponential  distribution).  Let  g(dx )  =  lx>oe~xdx  be  the  one-sided 
exponential  distribution.  In  this  problem,  we  will  derive  two  different  (and 
not  directly  comparable)  Poincare  inequalities  for  the  distribution  g. 

a.  Show  that 

VarM[/]  <  E[£|/,(£)|2],  £  ~  g. 

Hint:  show  that  £  ~  ( X 2  +  Y2)/ 2  where  X ,  Y  are  i.i.d.  1V(0, 1). 

b.  Show  that 

VarM[/]<4E[|/'(0|2],  £  ~ 

Hint:  use  /0°°  g{x)  e~x  dx  =  g(0)  +  /0°°  g'( x)  e~x  dx  with  g  =  (/  —  /( 0))2. 

These  two  distinct  Poincare  inequalities  correspond  to  two  distinct  Markov 
processes.  For  the  two  Markov  processes  defined  below,  show  that  their  Dirich- 
let  forms  do  indeed  yield  the  two  distinct  Poincare  inequalities  above: 

c.  The  solution  of  the  Cox-Ingersoll-Ross  stochastic  differential  equation 

dXt  =  2(1  -  Xt)  dt  +  2 y/X~t  dBt, 
which  is  a  Markov  process  on  R+  with  generator 

-^7(a ;)  =  2(1  -  x)f'(x)  +  2 xf"(x). 

d.  The  solution  of  the  stochastic  differential  equation 

dXt  =  —  sign(Xj)  dt  +  V2dBt, 
which  is  a  Markov  process  on  R  with  generator 

=  -sign  (x)f(x)  +  /"( x). 

This  process  has  the  two-sided  exponential  measure  n{dx)  =  \e~^dx  as 
its  stationary  distribution,  but  the  one-sided  Poincare  inequality  is  eas¬ 
ily  deduced  from  it.  Alternatively,  one  can  obtain  the  one-sided  inequality 
directly  by  considering  the  above  stochastic  differential  equation  with  re¬ 
flection  at  0  (i.e.,  a  Brownian  motion  with  negative  drift  reflected  at  0). 

Remark.  In  Problem  2.12  below,  we  will  encounter  yet  another  distinct 
Poincare  inequality  for  the  exponential  distribution. 
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2.10  (Dependent  random  signs).  Let  X-t, ... ,  Xn  be  random  variables 
with  values  in  {—1, 1}  whose  joint  distribution  is  denoted  by  p.  In  this  prob¬ 
lem,  we  do  not  assume  that  X\, . . .  ,Xn  are  independent.  Thus  we  cannot  use 
tensorization.  Nonetheless,  we  expect  that  if  X\, . . . ,  Xn  are  “weakly  depen¬ 
dent”  then  the  concentration  phenomenon  should  still  arise.  We  are  going  to 
use  Theorem  2.18  to  develop  a  precise  statement  along  these  lines. 

Define  the  influence  coefficient  of  variable  j  on  variable  i  as 

Cij  .  max  P  [X y  —  1 1  Xj  1,  {Xkf  k^i.j 

xS{  — l.l}71-2 

p [Xi  =  \\Xj  =  -1,  {Xkjk^ij  =  x]  | 

for  i  j,  and  let  Cvl  =  0.  If  the  random  variables  X\, . . . ,  Xn  are  weakly  de¬ 
pendent,  then  all  the  influences  Ctj  should  be  small.  The  goal  of  this  problem 
is  to  prove  the  following  Poincare  inequality: 

n 

(1  —  l|C||Sp)  Var[/(-Xi, . . . ,  Xn)]  <  E  £  Var  [/(*,, . . .  ,Xn)\{Xk}k^}  , 

where  ||Cj|sp  denotes  the  spectral  radius  of  the  matrix  C.  If  X\, . . .  ,Xn  are 
independent,  then  C  =  0  and  this  dependent  Poincare  inequality  reduces  to 
the  tensorization  inequality  for  independent  random  variables. 

The  basic  idea  is  to  mimick  the  Markov  process  construction  that  we  intro¬ 
duced  above  to  prove  tensorization.  To  this  end,  we  attach  to  every  coordinate 
i  =  1, ...  ,n  an  independent  Poisson  process  with  unit  rate.  The  random 
process  Zt  =  (Z%, . . . ,  Zf)  tSR+  is  now  constructed  as  follows: 

•  Draw  Zq  ~  p  independently  from  the  Poisson  processes  Nt[ , . . . ,  Nf . 

•  Each  time  N£  jumps  for  some  i,  replace  the  current  value  of  Z'\  by  an  inde¬ 
pendent  sample  from  jit(dXi\Zt)  while  keeping  the  remaining  coordinates 
fixed,  where  pi{dxi\x)  :=  P [Xi  G  •  | {Xk}k^i  =  {xk}k^i\- 

The  process  Zt  is  called  a  Gibbs  sampler  or  Glauber  dynamics  for  p. 

a.  Show  that  (Zt)tG r+  is  Markov  and  that  /x  is  stationary. 

b.  Show  that  the  generator  of  Zt  is  given  by 

n 

^7  =  -5>/,  Sif(x):=f{x)- 

i= 1 

and  that  the  Dirichlet  form  is  given  by 

n  « 

£(/,s0  =  I  SifSi9dp. 

i—1  J 

In  particular,  conclude  that  (Zt)te R+  is  reversible. 


f(x)  pi(dxi\x), 
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We  are  now  going  to  show  that  the  Markov  semigroup  is  exponentially  ergodic. 
c.  Define  the  local  oscillation 

A /  :=  max  |/(aq, . . . ,  a^-i,  1,  xi+\, . . . ,  xn)  - 

x6{  — 1,1}" 

/"(*£  1  7  *  '  *  5  1  ,  1,  *£'£-(-1  1  ■  •  •  5  *£?l)  |  * 

Show  that  for  i  ^  j 

A  J  f  dm  <  Ajf  +  A/  <A- 

cl.  Deduce  from  the  above  inequality  that 

x  /  \  / 

or,  in  terms  of  the  vector  Af  :=  (A/,  •  •  ■ ,  Anf)  of  local  oscillations, 

A(f  +  t&f/n)  <  Af  {/  -  t(I  -  C)/n}. 

e.  Show  using  the  power  series  identity  =  lim„^00(/  +  tA/n)™  that 

<  Afe-t(I~c). 

f.  Complete  the  proof  of  the  Poincare  inequality  (use  Theorem  2.18,  5  =>  1). 

Remark.  The  dependent  Poincare  inequality  extends  readily  to  non-binary 
random  variables  (i.e.,  not  in  {—1, 1}),  provided  Ctj  are  suitably  redefined. 


2.4  Variance  identities  and  exponential  ergodicity 

The  goal  of  this  section  is  to  prove  Theorem  2.18,  which  connects  the  Poincare 
inequality  to  the  exponential  ergodicity  of  a  Markov  semigroup.  At  first  sight, 
it  is  far  from  clear  why  Markov  semigroups  should  even  enter  the  picture:  what 
is  the  relation  between  VarM[/]  and  £(/,/)?  In  fact,  the  connection  between 
these  quantities  is  almost  trivial,  as  is  shown  in  the  following  lemma.  Once 
this  connection  has  been  realized,  Theorem  2.18  loses  most  of  its  mystery. 

Lemma  2.30.  The  following  identity  holds: 


|varM[PJ]  =  —2  £(Pt/,Pt/). 
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Proof.  Since  p(Ptf)  =  n{f), 

-  (pP.ff} 

=  "(2f,'/sp,/) 

=  M(2  Ptf&Ptf), 

and  the  result  follows  from  the  definition  of  the  Dirichlet  form.  □ 

Simple  as  this  result  is,  it  yields  many  important  consequences.  Let  us 
record  two  immediate  observations  for  future  reference. 

Corollary  2.31.  £(/,/)  >  0  for  every  f. 

Proof.  Immediate  from  Lemmas  2.9  and  2.30.  □ 

Corollary  2.32  (Integral  representation  of  variance).  Suppose  that  the 
Markov  semigroup  is  ergodic.  Then  we  have  for  every  f 

pOO 

VarM[/]  =  2  /  £(Ptf,Ptf)  dt. 

Jo 

Proof.  Note  that  Ptf  — >  jif  implies  Var;j  [Ptf]  — >  Var;i  fif]  =  0.  Thus 

pOO  7 

VarM[/]  =  VarM[Po/]  -  lim  VarM[Pt/]  =  -  /  —  VarM[Pt/]  dt 

t  ^oo  Jq  at 

by  the  fundamental  theorem  of  calculus.  Now  use  Lemma  2.30.  □ 

Remark  2.33.  Integral  representations  of  the  variance  such  as  the  expression 
in  Corollary  2.32  can  be  very  useful  in  different  settings.  We  will  encounter 
some  alternative  integral  representations  in  the  problems  below. 

We  are  now  ready  to  prove  the  implications  54=3=t-l<t=>2=^4of  Theo¬ 
rem  2.18  that  do  not  require  reversibility.  In  fact,  once  the  simple  observations 
made  above  have  been  realized,  these  implications  are  entirely  elementary. 

Proof  (Theorem  2.18,  Part  I).  The  implications  2  =>  4  and  3  =>  5  are  trivial. 
We  proceed  to  consider  the  remaining  implications. 

•  3  =>  1:  Assuming  3,  we  have  by  Corollary  2.32 

poo 

Var „[/]  <  2 £(/,/)  /  e~2t/c  dt  =  c£(/,/). 

Jo 
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•  1  =>  2:  Assuming  1,  we  have  by  Lemma  2.30 

^-Wai^Ptf]  <  --VarM[Pt/], 
dt  c 

from  which  we  obtain 


IIP/  -  M/ lll^)  =  VarM[Pt/]  <  e  2t/cVa xM[/]  =  e  2t/c||/  -  /x/|||2(([1). 

2  =>■  1:  Assuming  2,  we  obtain  using  Lemma  2.30 

mi, /)  -  lim  V"-[/1  y V"-|P’;I  >  Kn  VarJ/]  =  2  Vta.lfl. 


tjO  t  i|0  t 

This  completes  the  proof  of  the  implications  5<t=3=>l<t=>2=»4. 


It  remains  to  prove  the  implications  2  =>  3,  5  =>  3,  and  4  =>  2  of  Theorem 
2.18.  These  implications  require  reversibility,  which  we  have  not  yet  exploited. 
It  turns  out  that  reversibility  implies  a  much  finer  property  of  the  variance 
as  a  function  of  time  than  was  obtained  in  Lemma  2.30.  The  appropriate 
property  is  contained  in  the  following  useful  lemma. 


Lemma  2.34.  If  the  Markov  semigroup  Pt  is  reversible,  then  the  functions 
t  log  \\Ptf\\2L2^  and  t  e->  log  E(Ptf,Ptf)  are  convex. 

Proof.  Since  2z?  is  self-adjoint,  we  have 


±E(Ptf,Ptf)  =  -l(Ptf^Ptf)tl 

=  -  (Ptf,J?2Ptf)„ 

=  -2||ifP/|||2M. 

A  straightforward  computation  yields 


dt2 


log  IIP/lli2(M)  = 


4||-gP/|||2(/1)  4  £(Ptf,Ptf)2 

\\Ptf\\l*W  WPtfWUn) 

ii pj\4 


As  the  right-hand  side  is  nonnegative  by  the  Cauchy-Schwarz  inequality,  and 
we  have  shown  that  the  function  1 1— >  log  ||P/||22(Al)  is  convex.  The  proof  for 
t  i— >  log  E(Ptf,Ptf)  is  entirely  analogous,  once  we  observe  that  the  Dirichlet 
form  also  satisfies  the  Cauchy-Schwarz  inequality  £(/,  g)2  <  £(/,  f)8.(g,g)  (to 
prove  this,  use  that  £(/  +  tg ,  f  +  tg)  >  0  for  all  i  £  i  by  Corollary  2.31).  □ 


We  can  now  complete  the  proof  of  Theorem  2.18. 
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Proof  (Theorem  2.18,  Part  II).  We  first  prove  2  =>  3.  By  Lemma  2.34, 


1 1— > 


d_ 

dt 


log  11^/111^)  =  - 


2  £(Ptf,Ptf) 

\\Ptf\\hllt) 


is  increasing.  In  particular,  we  have 


2  £(Ptf,Ptf)  2  £(/,/) 

m/iiW)  -  ii/iii-o.) 


Rearranging  this  inequality  yields 

£(Ptf,Ptf)  \\Ptf\\hM 

£(/./)  ”  ll/ll  W)  ' 

Thus  property  3  follows  readily  from  property  2  in  Theorem  2.18. 

It  remains  to  prove  4  =>  2  and  5  =>  3.  In  fact,  both  these  implications 
follow  immediately  from  Lemma  2.34  by  applying  the  following  lemma  to  the 
functions  1 1— >  log  \\Ptf\\2Li^  and  1 1— >  log  £(Pt/,  Ptf)  are  convex.  □ 

Lemma  2.35.  If  the  function  g  :  R+  — >  R.  is  convex  and  g(t)  <  K  —  at  for 
all  t  >  0,  then  in  fact  g(t)  <  g(0)  —  at  for  all  t  >  0. 


Proof.  It  suffices  to  show  that  the  assumption  implies  that  g'(t)  <  — a  for  all 
t  >  0.  Suppose  that  this  is  not  the  case.  Then  there  exists  s  >  0  such  that 
g'(s)  =  —(3  >  — a .  As  g  is  convex,  g'  is  increasing  and  thus  g'(t)  >  —(3  for  all 
t  >  s.  In  particular,  it  follows  that  g(t)  >  g(s)  —  (3t  for  all  t  >  s.  As  (3  <  a, 
this  contradicts  the  assumption  that  g(t)  <  K  —  at  for  all  t  >  0.  □ 


Remark  2.36  (Finite  state  space  and  spectral  gaps).  While  the  elementary  im¬ 
plications  in  Theorem  2.18  are  entirely  intuitive,  the  role  of  reversibility  in 
the  remaining  implications  may  not  be  entirely  obvious:  indeed,  Lemma  2.34, 
which  containes  the  essence  of  the  reversibility  argument,  appears  as  a  bit 
of  a  miracle.  The  aim  of  this  remark  is  to  highlight  a  complementary  view¬ 
point  on  Theorem  2.18  that  sheds  additional  light  on  the  interpretation  of  the 
Poincare  inequality  and  on  the  role  of  reversibility.  While  this  viewpoint  can 
be  developed  more  generally,  we  restrict  attention  for  simplicity  to  the  setting 
of  finite  state  Markov  processes  as  in  Examples  2.12  and  2.15  above. 

Let  (Xt)tem.+  be  a  Markov  process  in  a  finite  state  space  Xt  £  {1, . . . ,  d}. 
Denote  by  A  the  transition  rate  matrix,  by  p  the  stationary  measure,  and  let 
us  assume  that  the  reversibility  condition  /jjAj,  =  PjAji  holds.  For  notational 
simplicity,  we  will  implicitly  identify  functions  and  measures  on  {1 , . . . ,  d} 
with  vectors  in  in  the  obvious  fashion.  Note  that  we  can  write 

d  d 

i  s')  =  ^  ^  HifiAjgj  ^  ^  •  1/ , i // <  ~  gj ) 

i,j= 1  i,j= i 

l  . , 

=  2  y  ]  PiAij(fi  —  fj)(gi  —  gj), 
i,j= i 


38 


2  Variance  bounds  and  Poincare  inequalities 


where  we  have  used  A ij  =  0  in  the  second  equality  and  that  PiAij{g.i  —  gj) 

is  a  skew-symmetric  matrix  in  the  third  equality.  In  particular,  we  have 

1  d 

£(/,/)  =  2  E  *«A aifi-fj )2- 

i,j= 1 

Again,  £(/,/)  can  be  naturally  interpreted  as  an  expected  square  gradient. 

Let  us  now  consider  the  Poincare  inequality  from  the  point  of  view  of 
linear  algebra.  As  the  matrix  A  is  self-adjoint  with  respect  to  the  weighted 
inner  product  (•,•}/*)  it  has  real  eigenvalues  Ai  >  A2  >  ■  •  •  >  A^  and  associ¬ 
ated  eigenvectors  Vd-  The  property  £(/,  /)  =  — (/,  yl/)M  >  0  evidently 

implies  that  Ai  <  0,  that  is,  all  the  eigenvalues  of  A  are  nonpositive.  More¬ 
over,  the  property  JA  AtJ  =  0  implies  that  iq  =  1  (the  vector  of  ones)  is  an 
eigenvector  with  maximal  eigenvalue  Ai  =  0.  If  gf  =  (1,  f) ^  =  0,  we  have 

£(/,  /)  =  ~(f,  Af)M  >  -A 2(/,  />„  =  (Ai  -  A2)  VarM[/], 

and  this  inequality  is  tight  for  /  =  n2.  Thus  the  best  constant  in  the  Poincare 
inequality  is  the  spectral  gap  Ai  —  A 2  of  the  generator  A.  For  this  reason, 
Poincare  inequalities  are  sometimes  called  spectral  gap  inequalities. 

We  can  now  also  understand  why  the  Poincare  inequality  is  so  closely 
related  to  exponential  convergence  of  the  Markov  semigroup.  Indeed,  let  /  be 
any  function,  and  expand  it  in  the  eigenbasis  of  A  as 

d 

f  =  ^ ajVi ■ 

i= 1 


Then  the  semigroup  acts  on  /  as 

d 

Ptf  =  etAf  =  ^2  eXitaiUi. 

i—1 


As  Ai  =  0,  we  have 


\\Ptf  -  nf\\l>w 

suPwv - 7ii2 - 


=  sup 
/ 


Ed 


e2Xitaf 


Ed 

i= 2  ' 


_  g  — 2(Ai  — A2 )t 


Thus  the  spectral  gap  Ai  —  A 2  controls  precisely  the  exponential  convergence 
rate  of  the  semigroup.  The  various  implications  of  Theorem  2.18  now  become 
rather  elementary  from  the  linear  algebra  viewpoint.  However,  the  fact  that 
these  equivalences  can  be  proved  hinges  from  the  outset  on  the  fact  that  A  ad¬ 
mits  a  spectral  decomposition  into  eigenvectors  with  real- valued  eigenvalues. 
This  explains  why  reversibility  of  the  semigroup  (that  is,  the  self-adjointness 
of  A)  is  essential  to  obtain  a  complete  set  of  equivalences  in  Theorem  2.18, 
despite  that  this  fact  was  not  entirely  explicit  in  our  general  proof  given  above. 
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Problems 

2.11  (Covariance  identities).  Let  Pt  be  a  reversible  ergodic  Markov  semi¬ 
group  with  stationary  measure  /i.  The  goal  of  this  problem  is  to  prove  useful 
integral  representations  of  the  covariance  CovM(/,  g)  :=  (f  —  pf,g  —  pg)^- 

a.  Prove  the  following  identity: 


CovM(/,  g)  =  2  £{Ptf,  Pt.g)  dt. 

Jo 

b.  Prove  the  following  identity: 

nOO 

Co  vli(f,g)=  Z(f1Ptg)dt. 

Jo 

c.  Let  X  ~  iV(0,  S)  be  a  centered  Gaussian  vector  in  K™  with  covariance 
matrix  S.  Assume  that  that  the  entries  are  positively  correlated,  that  is, 
Ejj  >  0  for  all  i.  j.  Prove  that  this  implies  the  following  much  stronger 
positive  association  property:  for  every  pair  of  functions  /,  g  that  are  coor- 
dinatewise  increasing,  we  have  Cav(f(X),  g(X))  >  0. 

Hint:  write  X  =  A1/2!"  for  Y  ~  iV(0, 1),  and  apply  one  of  the  above  identi¬ 
ties  for  the  n-dimensional  Ornstein-Uhlenbeck  process  (which  is  defined  in 
the  precisely  the  same  manner  as  the  one-dimensional  Ornstein-Uhlenbeck 
process  but  using  an  n-dimensional  Brownian  motion). 

2.12  (Local  Poincare  inequalities  I).  We  have  seen  that  the  validity  of 
a  Poincare  inequality  for  a  given  distribution  p  is  intimately  connected  with 
exponential  ergodicity  of  Markov  processes  that  admit  p  as  the  stationary 
measure.  In  this  problem,  we  will  develop  a  method  to  deduce  Poincare  in¬ 
equalities  for  the  distribution  of  the  Markov  process  Xt  at  a  finite  time  t, 
rather  than  for  the  stationary  distribution  (which  is  obtained  as  t  — >  oo).  In 
most  cases,  the  stationary  case  is  more  useful,  as  it  is  much  easier  to  construct 
a  Markov  process  that  admits  a  given  measure  p  as  its  stationary  measure 
than  to  construct  a  Markov  process  that  has  distribution  p  at  a  finite  time. 
Nonetheless,  there  are  several  situations  in  which  such  local  Poincare  inequal¬ 
ities  are  useful.  In  the  following  problem,  we  will  see  that  this  viewpoint 
provides  significant  insight  even  on  the  stationary  case. 

Let  Pt  be  a  Markov  semigroup  with  generator  «5f.  For  the  purposes  of  this 
problem,  we  do  not  assume  the  existence  of  a  stationary  measure. 

a.  Prove  the  following  variance  identity: 

PtU2)  -  ( Ptf )2  =  2  [  pt-sr{psf,p8f)  ds, 

Jo 

where  we  recall  the  definition  of  the  carre  du  champ  (Problem  2.7) 
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r(f,g)  ■■= 

Hint:  apply  the  fundamental  theorem  of  calculus  to  Pt-s((Psf)2). 

b.  Suppose  that  we  can  prove  an  identity  of  the  form 

r(P3f,  Psf)  <  a(s)  psr(f,  f) 

for  some  function  a  :  M+  — >  R+.  Conclude  that 

Ptif )  -  ( Ptf )2  <  c(t)  Ptr(f,  /),  c(t)  =  f  2  a(s)  ds. 

Jo 

Such  an  inequality  is  called  a  local  Poincare  inequality. 

c.  Let  (Wt)tSR+  be  standard  Brownian  motion.  Brownian  motion  is  itself  a 
Markov  process.  Compute  an  explicit  expression  for  its  semigroup  and  gen¬ 
erator  (in  analogy  with  Lemma  2.23),  and  show  that  in  this  case 

P(Ptf,Ptf)  <  Ptr(f,f). 

Show  that  the  local  Poincare  inequality  consequently  provides  a  alternative 
proof  of  the  Gaussian  Poincare  inequality  using  Brownian  motion. 

d.  The  present  approach  provides  a  convenient  method  to  derive  Poincare 
inequalities  for  infinitely  divisible  distributions  (this  part  requires  some  fa¬ 
miliarity  with  Levy  processes).  Let  v  be  a  positive  measure  on  R.  such  that 
Jr(1  ^  \x\ )is(dx)  <  oo,  and  let  X  be  an  infinitely  divisible  random  vari¬ 
able  whose  characteristic  function  has  the  Levy-Khintchin  representation 
E[e*Ma:]  =  exp {f(eluz  —  1  )v(dz)j.  Then  X  ~  Xi,  where  (Xt)t6R+  is  the 
Levy  process  with  Levy  measure  u.  The  latter  is  Markov  with  generator 

-^70)  =  J  Dyf{x)  v{dy),  Dyf{x)  :=  f(x  +  y)  -  f(x). 

Use  the  above  machinery  to  prove  the  following  Poincare  inequality: 

Var[/(X)]<E  j  (Dyf(X))2  u(dy)  . 

In  particular,  deduce  Poincare  inequalities  for  the  Poisson  distribution  and 
for  the  one-sided  exponential  distribution  (the  latter  being  distinct  from 
both  Poincare  inequalities  in  Problem  2.9  above). 

2.13  (Local  Poincare  inequalities  II).  The  approach  of  Problem  2.12 
makes  it  possible  to  obtain  Poincare  inequalities  using  Markov  processes  that 
do  not  admit  a  stationary  measure.  However,  even  for  ergodic  Markov  pro¬ 
cesses,  it  can  be  useful  to  develop  a  Poincare  inequality  for  the  stationary 
measure  /i  by  letting  t  — ■>  oo  in  a  local  Poincare  inequality.  The  reason  for  this 
is  the  following  result  that  will  be  proved  in  this  problem. 
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Theorem  2.37  (Local  Poincare  inequality).  The  following  are  equivalent: 

1.  cT2(f,  f)  >  T(f,f)  for  all  f  (Bakry-Emery  criterion). 

2.  r(Ptf,Ptf )  <  e~2t/cPtr{fJ)  for  all  f,t  (local  ergodicity). 

3.  Pt(f 2)  -  ( Ptf  )2  <  c(l  -  e~2t/c)Ptr(f,  f)  for  all  f,t  (local  Poincare). 

Here  we  defined 

A (/,<?)  :=  l{#r{f,g)-r{f,J?g)-r{&f,g)}. 

This  is  called  the  iterated  carre  du  champ  or  r2-operator. 

Why  is  this  result  useful?  Suppose  that  Pt  is  an  ergodic  Markov  semigroup 
with  stationary  measure  p.  To  prove  a  Poincare  inequality  using  Theorem 
2.18,  we  had  to  be  able  to  prove  exponential  ergodicity  of  the  semigroup. 
This  is  typically  a  nontrivial  matter:  one  cannot  readily  read  off  exponential 
ergodicity  from  the  expression  for  the  generator  ££ ',  for  example.  In  contrast, 
the  first  property  of  Theorem  2.37  is  an  algebraic  identity 

cr2(f,f)  >  r(f,f) 

that  can  be  verified  readily  from  the  expression  for  ££ '.  On  the  other  hand,  if 
this  identity  is  valid,  letting  t  —+  oo  in  property  3  of  Theorem  2.37  yields 

VarM[/]  <  c£(/,/) 

(cf.  Problem  2.7).  Thus  the  local  approach  provides  us  with  an  algebraic 
criterion  for  the  validity  of  a  Poincare  inequality.  This  can  be  extremely  useful, 
as  we  will  see  below.  However,  the  Bakry-Emery  criterion  is  strictly  stronger 
than  the  validity  of  a  Poincare  inequality  for  the  stationary  measure  p. 

Let  us  begin  by  proving  the  various  implications  of  Theorem  2.37 

a.  Prove  2  =>■  3.  Hint:  this  follows  easily  as  in  Problem  2.12. 

b.  Prove  1  =s>  2.  Hint:  £ Pt_sr(PJ,Psf ). 

c.  Prove  3  =>  1.  Hint:  limti0  t_2{P((/2)  -  {Ptf)2  ~  c(  1  -  e~2t/c)Ptr{f,  /)}. 
We  now  demonstrate  the  power  of  Theorem  2.37  in  an  important  example. 

d.  Let  p  be  a  probability  measure  on  K"  with  density  p{dx)  =  e~w ^  dx  where 
W  is  a  smooth  convex  function.  Such  distributions  are  called  log-concave. 
Note  that  is  X  ~  p,  then  X±, . . . ,  Xn  are  not  independent.  Nonetheless,  we 
have  the  following  result:  if  W  is  p-uniformly  convex,  that  is, 


d2W{x) 
3  dxidxj 


—  P  Vi  f°r  V  ^ 


then  we  have  the  dimension-free  Poincare  inequality 
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Var ,lf}<-pj  l|V/||2^. 

To  prove  it,  we  note  that  p  is  the  stationary  measure  of  the  Langevin 
stochastic  differential  equation  ( B  is  n-dimensional  Brownian  motion) 

dXt  =  —VW(Xt)  dtt  +  V2dBt, 


which  is  a  Markov  process  with  generator 


y'  9W  (x)  df{x)  y  d2f(x) 
^  dxi  dxi  ^  dx 2 

i—1  i—1  1 


Prove  the  log-concave  Poincare  inequality  using  the  Bakry-Emery  criterion. 

Remark.  We  have  shown  that  p-uniformly  log-concave  measures  admit  a 
dimension- free  Poincare  inequality  with  constant  p_1.  This  says  nothing  about 
general  case  where  p  may  be  zero.  One  of  the  deepest  open  problems  in  the 
theory  of  Poincare  inequalities  is  to  understand  the  situation  for  general  log- 
concave  measures.  It  has  been  conjectured  by  Kannan,  Lovasz  and  Simonovits 
that  if  p  is  a  log-concave  measure  on  Kn  with  zero  mean  and  identity  covari¬ 
ance  matrix,  then  Var M[/]  <  C  J  \\X f\\2  dp  for  a  universal  constant  C  (inde¬ 
pendent  of  the  dimension!)  To  date,  there  is  little  progress  in  this  direction. 
However,  several  interesting  ideas  have  been  developed  for  the  investigation 
of  such  problems,  including  a  localization  method  that  provides  a  partial  re¬ 
placement  for  tensorization  in  the  setting  of  log-concave  distributions. 


Notes 

§2.1.  The  tensorization  property  of  the  variance  is  classical.  It  is  sometimes 
called  the  Efron-Stein  inequality  after  [33],  where  it  was  used  to  investigate 
Tukey’s  jackknife  estimator.  The  importance  of  tensorization  as  a  fundamental 
principle  was  emphasized  by  Ledoux  [48].  The  random  matrix  example  was 
taken  from  [13].  Problems  2.4  and  2.5  are  from  [14]  and  [48],  respectively. 
Much  of  what  is  known  on  superconcentration  can  be  found  in  [18]. 

§2.2.  The  text  [52]  gives  an  introduction  to  Markov  processes  in  continuous 
time.  A  comprehensive  treatment  of  Markov  semigroups  and  their  connections 
with  functional  inequalities  is  given  in  [7]. 

§2.3  and  §2.4.  The  treatment  of  Poincare  inequalities  given  here  follows  [7], 
as  do  many  of  the  problems.  Problem  2.9  is  inspired  by  [10],  and  Problem  2.10 
is  taken  from  [99] .  The  application  of  local  Poincare  inequalities  to  infinitely 
divisible  distributions  in  Problem  2.12  is  inspired  by  [17].  See  [16]  for  more  on 
the  conjecture  of  Kannan,  Lovasz  and  Simonovits  for  log-concave  measures. 
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Subgaussian  concentration  and  log-Sobolev 
inequalities 


In  Chapter  2  we  investigated  the  simplest  form  of  the  concentration  phe¬ 
nomenon:  the  variance  of  a  function  f(X i, . . . ,  Xn)  of  independent  (or  weakly 
dependent)  random  variables  is  small  if  the  “gradient”  of  /  small.  This  is 
indeed  an  embodiment  of  the  concentration  phenomenon  as  it  was  informally 
presented  in  Chapter  1:  the  variance  measures  the  size  of  the  fluctuations  of 
the  random  variable  f(X i, . . . ,  Xn),  while  the  gradient  measures  the  sensi¬ 
tivity  of  f(x)  to  its  coordinates  xr.  While  variance  bounds  can  be  extremely 
useful  and  are  of  interest  in  their  own  right,  it  is  often  important  in  applica¬ 
tions  to  have  sharper  control  on  the  distribution  of  the  fluctuations. 

What  type  of  refined  behavior  can  we  expect?  Let  us  recall  our  original 
motivating  example  where  f(X i, . . . ,  Xn)  =  ln  Y^ik=i  is  a  linear  function. 
By  the  weak  law  of  large  numbers,  we  expect  that  the  fluctuations  are  of  order 

f(Xu  E  f(X1, . . . ,  Xn)  ~  a/y/E, 

which  is  indeed  captured  correctly  by  the  variance  bounds  developed  in  the 
previous  chapter.  In  this  case,  however,  the  central  limit  theorem  provides  us 
with  much  sharper  information:  it  controls  not  only  the  size  of  the  fluctua¬ 
tions,  but  also  the  distribution  of  the  fluctuations 

f(X1,  ...,Xn)-  E  f(Xu  N(  0,  a2/n). 

In  particular,  we  might  expect  that 

P[\f(X1,...,Xn)-Ef(X1,...,Xn)\  >t)  <e~ne^\ 

as  would  be  true  if  the  fluctuations  were  in  fact  Gaussian  (we  will  show  this 
below).  Such  a  Gaussian  tail  inequality  provides  much  more  precise  control 
of  the  fluctuations  than  a  bound  on  the  variance.  This  will  be  important,  for 
example,  in  understanding  the  behavior  of  suprema  later  on  in  the  course. 

As  in  the  previous  chapter,  it  turns  out  that  the  above  idea  is  not  restricted 
to  linear  functions,  but  is  in  fact  a  manifestation  of  a  general  phenomenon:  it  is 
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often  possible  to  obtain  Gaussian  tail  bounds  on  the  fluctuations  of  nonlinear 
functions  /  provided  that  their  “gradient”  is  small  in  a  suitable  sense.  In  this 
chapter,  we  begin  the  investigation  of  such  concentration  inequalities. 


3.1  Subgaussian  variables  and  Chernoff  bounds 

Before  we  can  prove  any  concentration  inequalities,  we  must  first  consider  how 
one  might  go  about  proving  that  a  random  variable  satisfies  a  Gaussian  tail 
bound.  Most  tail  bounds  in  probability  theory  are  proved  using  some  form  of 
Markov’s  inequality.  For  example,  if  we  have  a  bound  on  the  variance  as  in 
the  previous  chapter,  we  immediately  obtain  a  tail  bound  of  the  form 

P[\X-E[X}\ 

However,  this  bound  only  decays  as  t~2,  and  we  cannot  obtain  Gaussian  tail 
bounds  from  Poincare  inequalities  in  this  manner.  To  obtain  Gaussian  tail 
bounds,  we  must  use  Markov’s  inequality  in  a  more  sophisticated  manner. 
The  basic  method  is  known  as  the  Chernoff  hound. 

Lemma  3.1  (Chernoff  bound).  Define  the  log-moment  generating  function 
ip  of  a  random  variable  X  and  its  Legendre  dual  ip*  as 

ip(X)  :=  logE[eA(-Y"E-Y)],  if* fit)  =  sup{Ai  -  t/)( A)}. 

A>0 


Then  P[X  -  E X  >t\<  for  all  t  >  0. 

Proof.  The  idea  is  strikingly  simple:  we  simply  exponentiate  inside  the  prob¬ 
ability  before  applying  Markov’s  inequality.  For  any  A  >  0,  we  have 

P[X  -  EX  >  t\  =  P[eA(jY~EX)  >  ext]  <  e~AtE[eA(Y^EY)]  =  e-{xt~^x)} 

using  Markov’s  inequality  and  that  x  i— >  eXx  is  increasing.  As  the  left-hand 
side  does  not  depend  on  the  choice  of  A  >  0,  we  can  optimize  the  right-hand 
side  over  A  to  obtain  the  statement  of  the  lemma.  □ 

Remark  3.2.  Note  that  the  Chernoff  bound  only  gives  the  upper  tail,  that  is, 
the  probability  P[X  >  EX  + 1]  that  the  random  variable  X  exceeds  its  mean 
EX  by  a  fixed  amount.  However,  we  can  obtain  an  inequality  for  the  lower 
tail  by  applying  the  Chernoff  bound  to  the  random  variable  —X,  as 

P[X  <  EX  —  t]  =  P[-X  >  E[— X]  + 1]. 

In  particular,  given  an  upper  and  lower  tail  bound,  we  can  obtain  a  bound  on 
the  magnitude  of  the  fluctuations  using  the  union  bound 
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P[|X  -  EX|  >  t]  =  P[X  >  EX  +  t  or  X  <  EX  -  t] 

<  P[X  >  EX  + 1]  +  P[-X  >  E[— X]  +  t\. 


In  many  cases,  proving  an  upper  tail  bound  will  immediately  imply  a  lower  tail 
bound  and  a  two-sided  bound  in  this  manner.  On  the  other  hand,  sometimes 
upper  or  lower  tail  bounds  will  be  proved  under  assumptions  that  are  not 
invariant  under  negation.  For  example,  if  we  prove  an  upper  tail  bound  for 
convex  functions  /(X),  this  does  not  automatically  imply  a  lower  tail  bound 
as  —/(X)  is  concave  and  not  convex;  in  such  cases,  a  lower  tail  bound  must 
be  proved  separately.  One  should  therefore  be  careful  when  interpreting  tail 
bounds  to  check  separately  the  validity  of  upper  and  lower  tail  bounds. 


Remark  3.3.  The  utility  of  the  Chernoff  bound  is  by  no  means  restricted  to 
proving  Gaussian  tails  as  we  will  do  below.  One  can  obtain  many  different  tail 
behaviors  in  this  manner.  However,  the  method  clearly  only  works  if  ip(X)  is 
finite  at  least  for  A  in  a  neighborhood  of  0.  Therefore,  to  apply  the  Chernoff 
bound,  the  random  variable  X  should  have  at  least  exponential  tails.  For  ran¬ 
dom  variables  with  heavier  tails  an  alternative  method  is  needed,  for  example, 
one  could  take  powers  rather  than  exponentials  in  Markov’s  inequality: 


P[X 


EX  >t}<  inf 

pen 


E[(X  -  EX)^] 
tP 


In  fact,  even  when  the  Chernoff  bound  is  applicable,  it  is  not  difficult  to  show 
that  this  moment  bound  is  at  least  as  good  as  the  Chernoff  bound. 

Why  are  Chernoff  bounds  so  useful?  There  are  some  simple  examples,  such 
as  the  case  of  sums  of  random  variables,  where  the  Chernoff  bound  proves  to 
be  easy  to  manipulate  (we  will  exploit  this  in  the  next  section).  However,  the 
real  power  of  the  Chernoff  bound  is  that  the  log-moment  generating  function 
A  i— >  i/j(A)  is  a  continuous  object,  and  can  therefore  be  investigated  using 
calculus.  We  will  repeatedly  exploit  this  approach  in  the  sequel.  In  contrast, 
the  moment  bound  given  above  is  discrete  in  nature,  and  is  therefore  much 
more  difficult  to  handle.  As  we  will  mostly  be  interested  in  Gaussian  tail 
bounds,  we  will  make  full  use  of  the  convenience  of  the  Chernoff  method. 


To  show  how  the  Chernoff  bound  can  give  rise  to  Gaussian  tail  bounds, 
let  us  first  consider  the  case  of  an  actual  Gaussian  random  variable. 

Example  3-4 ■  Let  X  ~  o2).  Then  E[eA(Ar~EAA]  =  eA”<T“/2,  so 

\2  2  4-2 

<ka)  =  nt)  = 

In  particular,  we  have  P[X  —  EX'  >  t]  <  e_*2/2cr". 


Observe  that  in  order  to  get  the  tail  bound  in  Example  3.4,  the  fact  that 
X  is  Gaussian  was  not  actually  important:  it  would  suffice  to  assume  that  the 
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log-moment  generating  function  is  bounded  from  above  by  that  of  a  Gaussian 
ip( A)  <  A2cj2/2.  Random  variables  that  satisfy  this  condition  play  a  central 
role  in  the  investigation  of  Gaussian  tail  bounds. 

Definition  3.5  (Subgaussian  random  variables).  A  random  variable  is 
called  <r2-subgaussian  if  its  log-moment  generating  function  satisfies  ip(X)  < 
A2cr2/2  for  all  A  £  i  (and  the  constant  a2  is  called  the  variance  proxy). 

Note  that  if  ip(X)  is  the  log-moment  generating  function  of  a  random  vari¬ 
able  A,  then  ip{— A)  is  the  log-moment  generating  function  of  the  random 
variable  —X.  For  a  <r2-subgaussian  random  variable  X,  we  can  therefore  ap¬ 
ply  the  Chernoff  bound  to  both  the  upper  and  lower  tails  to  obtain 

P[A>E X  +  t]  <e~t2/2a2,  P[A<E X-t]  <e-t2/2a2. 


As  moment  generating  functions  will  prove  to  be  much  easier  to  manipulate 
than  the  tail  probabilities  themselves,  we  will  almost  always  study  Gaussian 
tail  behavior  of  random  variables  in  terms  of  the  subgaussian  property.  Fortu¬ 
nately,  it  turns  out  that  little  is  lost  in  making  this  simplification:  any  random 
variable  that  satisfies  Gaussian  tail  bounds  must  necessarily  be  subgaussian 
(albeit  for  a  slightly  larger  variance  proxy),  cf.  Problem  3.1  below. 

So  far,  the  only  examples  of  subgaussian  random  variables  that  we  have 
encountered  are  Gaussians,  which  is  not  terribly  interesting.  One  of  the  most 
basic  results  on  subgaussian  random  variables  is  that  every  bounded  random 
variable  is  subgaussian.  This  statement  is  made  precise  by  Hoeffding’s  lemma, 
which  could  be  viewed  as  a  far-reaching  generalization  of  the  trivial  Lemma 
2.1.  Even  in  this  simple  setting,  the  proof  provides  a  nontrivial  illustration  of 
the  important  role  of  calculus  in  bounding  moment  generating  functions. 


Lemma  3.6  (Hoeffding  lemma).  Let  a  <  X  <  b  a.s.  for  some  a,b  £  R. 
Then  E[eA(A~EA)]  <  eA  (b~al  /8,  i.e.,  X  is  (b  —  a)2 / 4- subgaussian. 


Proof.  We  can  assume  without  loss  of  generality  that  EA  =  0.  In  this  case 
we  have  ip(X)  =  log  E[eAA],  and  we  can  readily  compute 


V/(A) 


E[AeA-Y] 
E[eAA]  ’ 


n  a) 


E[A2eAA] 

E[eAA] 


E[AeAA]  1  2 
_  E[eAA] 


Thus  if>"  (A)  can  be  interpreted  as  the  variance  of  the  random  variable  X  under 
the  twisted  probability  measure  dQ  =  E^e\,Y]  dP.  But  then  Lemma  2.1  yields 
ip"(X)  <  (b  —  a)2/4,  and  the  fundamental  theorem  of  calculus  yields 


ip{X)  = 


ip"(p)  dp  dp  < 


X2(b-af 


io  Jo 


using  ij)( 0)  =  log  1  =  0  and  ip' { 0)  =  EA  =  0. 


□ 


3.1  Subgaussian  variables  and  Chernoff  bounds 


47 


Problems 

3.1  (Subgaussian  variables).  There  are  several  different  notions  of  random 
variables  with  a  Gaussian  tail  that  are  all  essentially  equivalent  up  to  con¬ 
stants.  The  aim  of  this  problem  is  to  obtain  some  insight  into  these  notions. 

a.  As  a  warmup  exercise,  show  that  if  AT  is  <r2-subgaussian,  then  Var[AT]  <  cr2. 

b.  Show  that  for  any  increasing  and  differentiable  function 

pOO 

E[£(|AT|)]  =<£(0)  +  /  &(t)P[\X\>t]dt. 

Jo 

This  elementary  identity  will  be  needed  below. 

In  the  following,  we  will  assume  for  simplicity  that  EAT  =  0.  We  now  prove 
that  the  following  three  properties  are  equivalent  for  suitable  constants  a,  b ,  c: 
(1)  X  is  cr2-subgaussian;  (2)  P[|AT|  >  t]  <  2e— ;  and  (3)  E[ecA’2]  <  2. 

c.  Show  that  if  X  is  cr2-subgaussian,  then  P[|AT|  >  t]  <  2e-t  /2<T  . 

d.  Show  that  if  P[|AT|  >  t]  <  2e-t  /2<T  ,  then  E[eA  /6<T  ]  <  2. 

Hint:  use  part  a. 

e.  Show  that  if  E[eA2/6<T”]  <  2,  then  X  is  18cr2-subgaussian. 

Hint:  use  E[eAA]  <  1  +  4j-E[AT2elAAl]  by  Taylor’s  theorem  together  with 
Young’s  inequality  |AAT|  <  for  a  suitable  choice  of  a. 

In  addition,  the  subgaussian  property  of  X  is  equivalent  to  the  fact  that  the 
moments  of  X  scale  as  is  the  case  for  the  Gaussian  distribution. 

f.  Show  that  if  X  is  cr2-subgaussian,  then  E[X29]  <  (4cr2)9<7!  for  all  g£N. 
Hint:  use  part  a. 

g.  Show  that  if  E[A2?]  <  (4a2)9q!  for  all  g£N,  then  E[eA  ^8a  }  <  2. 

Hint:  expand  in  a  power  series. 

3.2  (Tightness  of  Hoeffding’s  lemma).  Show  that  the  bound  of  Hoeffcl- 
ing’s  lemma  is  the  best  possible  by  considering  P [X  =  a]  =  P [X  =  b]  = 

3.3  (Chernoff  bound  vs.  moments).  Show  that  for  t  >  0 

P[X  —  EAT  >t]<  inf  E[(X  ~  EA')+]  <  inf  e-AtEreA(x-EX)i_ 

p>0  tP  A>0  L  J 

Thus  the  moment  bound  of  Remark  3.3  is  at  least  as  good  as  the  Chernoff 
bound.  However,  the  former  is  much  harder  to  use  than  the  latter. 

Hint:  use  E[eA^A_EA^]  >  E[lx-E.v>oeA('A_EA^  and  expand  in  a  power  series. 


48 


3  Subgaussian  concentration  and  log-Sobolev  inequalities 


3.4  (Chernoff  bound  exercises).  Compute  the  explicit  form  of  the  Chernoff 
bound  for  Poisson  and  Bernoulli  random  variables. 


3.5  (Maxima  of  subgaussian  variables).  Let  X\,X2, ...  be  (not  necessar¬ 
ily  independent)  <r2-subgaussian  random  variables.  Show  that 


P 


max{X,; 

i<.n 


EXj}  >  (1  +  e)a sj2  log  n 


-+  0 


for  all  e  >  0. 


Hint:  use  the  union  bound 


P[X  V  Y  >t]  =  P[X  >torY>t}<  P[X  >t}+  P[Y  >  t]. 

This  problem  shows  that  the  maximum  max.,<n{X,  —  EX^}  of  <j2-subgaussian 
random  variables  is  at  most  of  order  ay/2  log  n.  This  is  the  simplest  example 
of  the  crucial  role  played  by  tail  bounds  in  estimating  the  size  of  maxima  of 
random  variables.  The  second  part  of  this  course  will  be  entirely  devoted  to 
the  investigation  of  such  problems  (using  much  deeper  ideas). 


3.2  The  martingale  method 

Let  X\, . . . ,  Xn  be  independent  random  variables.  In  the  previous  chapter,  we 
showed  that  the  variance  of  /(X i, . . . ,  Xn)  can  be  bounded  in  many  cases  by 
a  “square  gradient”  of  the  function  /.  The  aim  of  this  chapter  is  to  obtain 
a  much  stronger  type  of  result:  we  would  like  to  show  that  /(X1; . . . ,  Xn)  is 
subgaussian  with  variance  proxy  controlled  by  a  “square  gradient”  of  /. 

A  key  idea  developed  in  the  previous  chapter  was  to  use  tensorization 
to  reduce  the  problem  to  the  one-dimensional  case.  With  the  tensorization 
inequality  in  hand,  we  could  even  apply  a  trivial  bound  such  as  Lemma  2.1 
to  obtain  a  nontrivial  variance  inequality  in  terms  of  bounded  differences. 
Our  first  instinct  in  the  present  setting  is  therefore  to  prove  a  tensorization 
inequality  for  the  subgaussian  property,  which  could  then  be  combined  with 
Hoeffding’s  Lemma  3.6  (which  plays  the  analogous  role  in  the  present  setting 
to  the  trivial  Lemma  2.1  for  the  variance)  in  order  to  obtain  a  concentration 
inequality  in  terms  of  bounded  differences.  Unfortunately,  it  turns  out  that 
unlike  in  the  case  of  the  variance,  the  subgaussian  property  does  not  tensorize 
in  a  natural  manner,  and  thus  we  cannot  directly  implement  this  program.  One 
of  the  most  important  ideas  that  will  be  developed  in  the  following  sections 
is  that  the  proof  of  subgaussian  inequalities  can  be  reduced  to  a  strengthened 
form  of  Poincare  inequalities,  called  log-Sobolev  inequalities,  that  do  tensorize 
exactly  in  the  same  manner  as  the  variance.  This  will  provide  us  with  a  very 
powerful  tool  to  prove  subgaussian  concentration. 

There  is,  however,  a  more  elementary  approach  that  should  be  attempted 
before  we  begin  introducing  new  ideas.  Even  though  the  subgaussian  property 
does  not  tensorize  in  the  same  manner  as  the  variance,  we  can  still  repeat  some 
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of  the  steps  in  the  proof  of  the  tensorization  Theorem  2.3  in  the  subgaussian 
setting.  Recall  that  the  main  idea  of  the  proof  of  Theorem  2.3  is  to  write 

n 

f(X  1} .  E  f(Xu ...,  Xn)  =  Y,  Ak, 

k= 1 

where 


Ak  =  E  [f(Xlt  ...,Xn)\Xu...,Xk\- E[f(Xi,  .  •  ■ ,  Xn)\Xu  ■  ■  ■ ,  *k-i] 

are  martingale  differences.  The  following  simple  result,  which  exploits  the  nice 
behavior  of  the  exponential  of  a  sum,  could  be  viewed  as  a  sort  of  poor  man’s 
tensorization  property  for  sums  of  martingale  increments.  By  working  directly 
with  the  martingale  increments,  we  will  be  able  to  derive  a  first  concentration 
inequality.  This  approach  is  commonly  known  as  the  martingale  method. 

Lemma  3.7  (Azuma).  Let  {Tfe be  any  filtration,  and  let  A\, . . . ,  An  be 
random  variables  that  satisfy  the  following  properties  for  k  =  1, . . . ,  n: 

1.  Martingale  difference  property:  Ak  is  T k-measurable  and  E[Z\fc|Tfe_i]  =  0. 

2.  Conditional  subgaussian  property:  E[eAAfc  |Tfc-i]  <  eA  CTfc/2  a.s. 

Then  the  sum  Y^k=i  Ak  is  subgaussian  with  variance  proxy  X]fc=i  ak- 

Proof.  For  any  1  <  k  <  n,  we  can  compute 

E[eAE>=lA‘]  =  F,{ex^=iA‘  E[eA/ifc  |Tfe_i]]  <  eA2<T'/2E[eA A'}. 

It  follows  by  induction  that  E[eA^=1  Ai]  <  eA2  ff/2 _  □ 

Remark  3.8.  While  we  did  not  explicitly  use  the  martingale  difference  property 
in  the  proof,  E[eAzifc  |Tfc_i]  <  eA  a*/2  can  in  fact  only  hold  if  E[Z\fc|fFfc_i]  =  0 
(consider  (E[eAzi'c  |Tfc_i]  —  1)/A  as  A  J,  0).  In  general,  the  conditional  subgaus¬ 
sian  property  of  X  given  T  should  read  E[eA(A~E[A  |T]  <  eA  a  /2  a.s. 

In  combination  with  Hoeffding’s  Lemma  3.6,  we  now  obtain  a  classical 
result  on  the  tail  behavior  of  sums  of  martingale  differences. 

Corollary  3.9  (Azuma-Hoeffding  inequality).  Let  {Jk}k<n  be  any  filtra¬ 
tion,  and  let  Ak ,  Ak ,  Bk  satisfy  the  following  properties  for  k  =  1, . . . ,  n: 

1.  Martingale  difference  property:  Ak  is  ? k -measurable  and  ’E\Ak\Jk-\\  =  0. 

2.  Predictable  bounds:  Ak,Bk  are  T k-i-measurable  and  Ak  <  Ak  <  Bk  a.s. 

Then  Y^k-i  Ak  is  subgaussian  with  variance  proxy  |  || Bk  —  ||^o -  In 

particular,  we  obtain  for  every  t  >  0  the  tail  bound 


P  ^  Ak>t  <  exp 

_fe= l 
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Proof.  Applying  Hoeffding’s  Lemma  3.6  to  Ak  conditionally  on  $k-i  implies 
E[eA/lfc  <  eA2(B't_Afc)2/8.  The  result  now  follows  from  Lemma  3.7.  □ 

Example  3.10.  The  Azuma-Hoeffding  inequality  is  often  applied  in  the  fol¬ 
lowing  setting.  Let  Xi, ... ,  Xn  be  independent  random  variables  such  that 
a  <  Xi  <  b  for  all  i.  Applying  Corollary  3.9  with  Ak  =  (Xk  —  E Xk)/n  yields 


P 


-  V{Xi  -  E Xi}  >  t 

n  z — ' 

L  fc= i 


<  e-2 nt2/(b-a)2 


By  the  central  limit  theorem,  this  bound  is  of  the  correct  order  both  in  terms 
of  the  size  of  the  sum  and  its  Gaussian  tail  behavior.  However,  just  as  for 
the  case  of  the  variance  (see  the  discussion  in  section  2.1),  this  bound  can  be 
pessimistic  in  that  it  does  not  capture  any  information  on  the  distribution  of 
the  variables  X, :  in  particular,  the  variance  proxy  ( b  —  a)2 / 4  may  be  much 
larger  than  the  actual  variance  of  the  random  variables  A,:.  Much  of  the  effort 
in  developing  concentration  inequalities  is  to  obtain  bounds  in  terms  of  “good” 
variance  proxies  for  the  purposes  of  the  application  at  hand. 


We  motivated  the  development  of  tail  bounds  for  martingale  differences  as 
a  partial  replacement  of  the  tensorization  inequality  for  the  variance.  Let  us 
therefore  return  to  the  case  of  functions  f(X i, . . . .  Xn)  of  independent  ran¬ 
dom  variables  X\, . . . ,  Xn .  Using  the  Azuma-Hoeffding  inequality,  we  readily 
obtain  our  first  and  simplest  subgaussian  concentration  inequality.  Recall  that 


Dif(x)  := 

sup  f(x  1,  .  .  .  ,  Xi-l,Z,  xi+1, ...,  xn)  -  inf  f(xlt ...,  Xi-i,Z,  Xi+1,  ...,xn) 

z  z 

are  the  discrete  derivatives  defined  in  section  2.1. 

Theorem  3.11  (McDiarmid).  For  Xi, . . . ,  Xn  independent ,  f(Xi,...,Xn) 
is  subgaussian  with  variance  proxy  |  ^ n  particular, 

P  [/PG,.  -.,Xn)-  E  f(X1, . .  .,Xn)  >t]<  e-2t2/^=i 
Proof.  As  in  the  proof  of  the  tensorization  Theorem  2.3,  we  write 

n 

f(X  r, .  E/(A1; . . . ,  Xn)  =  Y,  Ak , 

k=  1 

where 

=  E  [f(Xu  ...,Xn)\X1,...,Xk}-  E  [/(*!,  ■  •  ■ ,  Xn)\Xu  . . . ,  Afc_i] 
are  martingale  differences.  Note  that  Ak  <  Ak  <  Bk  with 
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Ak  =  E[Mf(X1,...,Xk_1,z,Xk+1,...,Xn)-f(X1,...,Xn)\X1,...,Xk_1], 
Bk=E[supf(X1,...,Xk_1,z,Xk+1,...,Xn)-f(X1,...,Xn)\X1,...,Xk_1] 

Z 

where  we  have  used  the  independence  of  X k  and  X\, . . . ,  Xk_i,  Xk+\, . . . ,  Xn. 
The  result  now  follows  immediately  from  the  Azuma-Hoeffding  inequality  of 
Corollary  3.9  once  we  note  that  \Bk  —  Ak\  <  ||Hfc/||oo-  □ 

McDiarmid’s  inequality  should  be  viewed  as  a  subgaussian  form  of  the 
bounded  difference  inequality  of  Corollary  2.4.  In  Corollary  2.4,  the  variance 
is  controlled  by  the  expectation  of  the  “square  gradient”  of  the  function  /.  In 
contrast,  McDiarmid’s  inequality  yields  the  stronger  subgaussian  property,  but 
here  the  variance  proxy  is  controlled  by  a  uniform  upper  bound  on  the  “square 
gradient”  rather  than  its  expectation.  Of  course,  it  makes  sense  that  a  stronger 
property  requires  a  stronger  assumption.  We  will  repeatedly  encounter  this 
idea  in  the  setting  of  concentration  inequalities:  typically  the  expectation  of 
the  “square  gradient”  controls  the  variance,  while  a  uniform  bound  on  the 
“square  gradient”  controls  the  subgaussian  variance  proxy. 

However,  from  this  viewpoint,  the  result  of  Theorem  3.11  is  not  satisfac¬ 
tory:  as  the  appropriate  notion  of  “square  gradient”  in  the  bounded  differ¬ 
ence  inequality  is  ]Cl-=i  |-Da-/|2,  we  would  expect  a  variance  proxy  of  order 
II  Ylk=  l  |£>fc/|2||oo;  however,  Theorem  3.11  only  yields  control  in  terms  of  the 
larger  quantity  Ylk= l  ll-^fc/ll<x>-  The  former  would  constitute  a  crucial  im¬ 
provement  over  the  latter  in  many  situations  (for  example,  in  the  setting  of 
the  random  matrix  Example  2.5).  Unfortunately,  the  martingale  method  is  far 
too  crude  to  capture  this  idea.  In  the  sequel,  we  will  develop  new  techniques 
for  proving  subgaussian  concentration  inequalities  that  will  make  it  possible 
to  prove  much  more  refined  bounds  in  many  settings. 

Problems 

3.6  (Bin  packing).  For  the  bin  packing  Problem  2.3,  show  that  the  variance 
bound  Var[H„]  <  n/4  can  be  strengthened  to  a  Gaussian  tail  bound 

P[\Bn-EBn\  >t\  <2e~2t2/n. 

In  view  of  Problem  2.3,  this  bound  has  the  correct  order. 

3.7  (Rademacher  processes).  Let  e i,...,en  be  independent  symmetric 
Bernoulli  random  variables  P[e,  =  ±1]  =  and  let  T  C  R™.  Define 

n 

z  =  suPy>fc4. 


In  Problem  2.2,  we  showed  that 
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Var  [Z\  <  4  sup  ^  t 


t£T 


fc=l 


Show  that  McDiarmid’s  inequality  can  give,  at  best,  a  bound  of  the  form 


P[\Z  -  EZ|  >  t]  <  2e-t2/2a2  with  cr2 


Show  by  means  of  an  example  that  the  variance  proxy  in  McDiarmid’s  in¬ 
equality  can  exhibit  a  vastly  incorrect  scaling  as  a  function  of  dimension  n. 

3.8  (Empirical  frequencies).  Let  X  -\ , . . . ,  Xn  be  i.i.cl.  random  variables 
with  any  distribution  /i  on  a  measurable  space  E,  and  let  6  be  a  countable 
class  of  measurable  subsets  of  E.  By  the  law  of  large  numbers, 

#{fc€{l,...,rj}:  A).£C} 

- - - «  H{C) 

n 

when  n  is  large.  In  order  to  analyze  empirical  risk  minimization  methods  in 
machine  learning,  it  is  important  to  control  the  deviation  between  the  true 
probability  l-i(C)  and  its  empirical  average  uniformly  over  the  class  C.  In 
particular,  one  would  like  to  guarantee  that  the  uniform  deviation 


Zn  =  sup 
cee 


ff{k  £  {1, . . . ,  n}  :  Xj-  £  C} 
n 


k{C) 


does  not  exceeed  a  certain  level  with  high  probability.  As  a  starting  point 
towards  proving  such  a  result,  show  that  for  every  n  >  1  and  t  >  0 

P[Zn>EZn+t]  <e~2nt2. 


To  obtain  a  bound  on  P [Zn  >  t],  it  therefore  remains  to  control  E Zn  (the 
techniques  for  this  will  be  developed  in  the  second  part  of  the  course). 

3.9  (Sums  in  Hilbert  space).  Let  X-t. ,  Xn  be  independent  random  vari¬ 
ables  with  zero  mean  in  a  Hilbert  space,  and  suppose  that  ||ATfc||  <  C  a.s.  for 
every  k.  Let  us  prove  a  sort  of  Hilbert- valued  analogue  of  Example  3.10. 

a.  Show  that  for  all  t  >  0 


P 


1 

n 


k= 1 


>  E 


1 

n 


k= 1 


+  t 


<  e~nt2/2C\ 


b.  Show  that 


n 


E^ 

k=l 


<  On-1'2. 
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c.  Conclude  that  for  all  t  >  Cn  h2 


n  z — J 


k= 1 


>  t 


d.  Finally,  argue  that  for  all  t  >  0 

1 


k=  1 


>  t 


<  e-nt2/8C \ 


<  2e~nt2/8c2. 


3.10  (Random  graphs).  An  Erdos-Renyi  random  graph  G(n,p)  is  a  graph 
on  n  vertices  such  that  for  every  pair  of  vertices  v,  v'  there  is  an  edge  between 
them  with  probability  p,  independently  of  the  other  edges.  A  coloring  of  the 
graph  is  the  assignment  of  a  color  to  each  vertex  such  that  every  pair  of 
vertices  connected  by  an  edge  have  a  distinct  color.  The  chromatic  number  x 
is  the  minimal  number  of  colors  needed  to  color  the  graph.  Show  that 

P[|X-E%|  >  tVn]  <  2e~t2 . 


It  can  be  shown  that  the  chromatic  number  satisfies  Ey  ~  n/21ogbn  as  n  —> 
oo,  where  6  =  1/(1— p).  We  therefore  see  that  the  fluctuations  of  the  chromatic 
number  are  of  much  smaller  order  than  its  magnitude. 

3.11  (A  generalization  of  Azuma-Hoeffding).  Consider  the  same  setting 
as  in  Corollary  3.9.  The  Azuma-Hoeffding  inequality  provides  a  Gaussian  tail 
bound  in  the  case  that  | Bk—Ak  |  is  uniformly  bounded,  but  this  may  not  always 
hold  in  practice.  Prove  the  following  general  form  of  the  Azuma-Hoeffding 
inequality  that  does  not  require  boundedness  of  the  increments: 


^  >  t  and  ^( Bk  -  Ak)2  < 


k- 1 


fc= l 


<  e-2‘2/c2 


Hint:  consider  A^fe=i  Ak  -  ^  T,k=i(Bk  ~  M)2 ■ 
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The  martingale  method  developed  in  the  previous  section  has  many  useful 
applications.  Nonetheless,  as  was  explained  above,  the  inequalities  derived 
from  this  approach  are  often  unsatisfactory  in  high  dimension.  In  essence, 
the  fundamental  problem  is  that  the  subgaussian  property  does  not  tensorize 
naturally,  and  the  martingale  method  can  only  partially  address  this  issue.  In 
order  to  obtain  sharper  results,  we  must  confront  the  tensorization  problem 
directly.  In  this  section,  we  will  introduce  a  powerful  method  to  do  just  that. 
The  key  idea  is  to  introduce  an  alternative  formulation  of  the  subgaussian 
property  that  behaves  naturally  under  tensorization. 
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Recall  that  a  random  variable  X  is  subgaussian  if  its  log-moment  generat¬ 
ing  function  satisfies  %j){ A)  :=  log E[eA(A_EAA]  <  A2.  We  have  already  seen  the 
importance  of  using  calculus  to  bound  moment  generating  functions  in  the 
proof  of  Hoeffding’s  Lemma  3.6:  the  idea  used  there  is  that  if  ix^(A)  <  1, 
then  the  subgaussian  property  is  obtained  by  integrating  twice.  The  idea  be¬ 
hind  the  following  result  is  very  similar:  as  the  subgaussian  property  is  equiv¬ 
alent  to  A_1V’(A)  <  A,  it  suffices  to  show  that  ^A_1^(A)  <  1. 

Definition  3.12.  The  entropy  of  a  nonnegative  random  variable  Z  is 

Ent  [Z]  :=  E[Zlog  Z]  -  E[Z]logE[Z]. 

Lemma  3.13  (Herbst).  Suppose  that 

\  2  2 

Ent[eAY]  <  2  E[eAJY]  for  all  A  >  0. 

Then 

ip(X)  :=  log  E[eA(-Y”E-Y)]  <  for  all  A  >  0. 

Proof.  As  ip(A)  =  logE[eAY]  —  AEA,  we  have 

d  m  =  1  E[XeA-Y]  _  1 =  1  Ent[eA'Y] 
dX  A  A  E[exx]  A2  8  1  J  A2  E[exx]  ' 

Thus  the  assumption  of  the  lemma  yields 

ijj( A)  /'A  1  Ent[e“A]  ^  Act2 

~  =  J0  u1  E[e“A]  dU~^T 

using  the  fundamental  theorem  of  calculus  and  lim;qo  A~1ip( A)  =  0.  □ 

As  an  immediate  consequence,  we  see  that  if  a  random  variable  X  satisfies 

\  2  2 

Ent[eAA]  <  2  E[eAA]  for  all  AeR, 

then  X  is  CT2-subgaussian.  Thus  we  have  a  sufficient  condition  for  the  subgaus¬ 
sian  property  in  terms  of  entropy.  In  fact,  up  to  a  constant  factor,  the  converse 
2 

is  also  true:  if  X  is  ^--subgaussian,  then  the  assumption  of  Lemma  3.13  holds 
(Problem  3.12).  We  may  therefore  view  the  above  entropy  inequality  as  an 
alternative  formulation  of  the  subgaussian  property  of  a  random  variable  X. 

It  may  not  be  immediately  evident  what  we  have  accomplished.  Indeed,  we 
have  obtained  yet  another  formulation  of  the  subgaussian  property,  which  may 
appear  at  first  sight  no  more  useful  than  any  other  (and  perhaps  somewhat 
less  intuitive  than  most).  However,  the  formulation  in  terms  of  entropy  proves 
to  be  a  very  powerful  idea.  For  example,  we  will  presently  show  that  entropy 
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obeys  an  exact  analogue  of  the  tensorization  property  of  the  variance ,  from 
which  its  utility  in  high  dimension  will  be  immediately  obvious.  In  fact,  it 
turns  out  that  entropy  behaves  in  many  ways  like  the  variance.  Once  we  are 
comfortable  with  this  idea,  it  will  become  evident  that  several  other  notions 
from  Chapter  2  extend  naturally  to  the  subgaussian  setting. 

To  formulate  the  tensorization  inequality,  let  X\ , . . . ,  Xn  be  independent 
random  variables.  For  each  function  f(x i, . . . ,  xn),  we  define  the  function 

Ent»/(a;i, . .  .,xn)  :=  Ent[/(xi, . . .  ,xi_1,Xi,xi+1, . . 

That  is,  Ent,/ (a:)  is  the  entropy  of  f(X1, . . . ,  Xn)  with  respect  to  the  variable 
Xi  only,  the  remaining  variables  being  kept  fixed. 

Theorem  3.14  (Tensorization  of  entropy).  We  have 

n 

Ent [/(*!, . . . ,  Xn)\  <  E  Y,  Ent ■  •  •  ,Xn) 

.  i— 1 

whenever  Xi, . . . ,  Xn  are  independent. 

To  prove  Theorem  3.14  we  will  need  a  fundamental  result  that  can  be 
viewed  as  an  analogue  of  Holder’s  inequality  for  entropy. 

Lemma  3.15  (Variational  formula  for  entropy).  We  have 

Ent[Z]  =  sup{E[ZX]  :  X  is  a  random  variable  satisfying  E[eA]  =  1}. 

Proof.  Let  E[eA]  =  1  and  define  the  new  probability  dQ  =  eAdP.  Then 

Ent [Z\  -  E [ZX]  =  E [Z log  Z]  -  E[Zlog  eA]  -  E [Z]  log E [Z] 

=  EQ[e-xZlog(e~xZ)}  -  EQ[e-A'Z]  logEQ[e-AZ]. 

As  x  i— >  a;  log  a:  is  convex,  it  follows  from  Jensen’s  inequality  that  Ent[Z]  — 
E [ZX]  >  0  for  every  random  variable  X  such  that  E[eA]  =  1.  But  note  that 
Ent[Z]  —  E [ZX]  =  0  for  X  =  log(Z/E[Z]),  and  thus  the  proof  is  complete.  □ 

We  can  now  complete  the  proof  of  Theorem  3. 14. 

Proof  (Theorem  3. If).  Let  Z  =  f(X i, . . . ,  Xn),  and  define  for  k  =  1, . . .  ,n 

Uk  =  log  E[Z\XU  ...,Xk]~  log  E[Z\XU  . . . ,  Xk_!]. 

The  evidently 

n 

Ent [Z\  =  E[Z(log  Z  —  log  E[Z])]  =  Y  Eizuk\ ■ 

k= 1 


On  the  other  hand,  note  that 
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E[eUk\Xll...,Xk_1,Xk+1,...,Xn\ 

_  E[E[Z\X1,...,Xk]\X1,...,Xk_1,Xk+1,...,Xn] 
¥1[Z\Xu...,Xk_1} 

_  E[E[Z\X1,...,Xk]\X1,...,Xk_1]  _ 

nz\xl,...,xk_l\ 

where  we  have  used  that  Xk+i,...,Xn  and  Xi,...,Xk  are  independent. 
Therefore,  applying  Lemma  3.15  conditionally  yields 

'E[ZUk\X1, . . . ,  Xk_i,  Xk+i, . . . ,  Xn] 

<  Ent[Z\Xi, . . . ,  Xk~i,  Xk+i, . . . ,  Xn] 

=  Ent  kf(Xu...,Xn), 

where  Ent[Z|S]  :=  E[ZlogZ|S]  —  E[Z|S]  logE[Z|S].  In  particular, 

E[ZUk]  <H\Entkf(X1,...,Xn)], 

by  the  tower  property,  and  the  proof  is  complete.  □ 

The  entropic  formulation  of  the  subgaussian  property  and  the  tensoriza- 
tion  inequality  for  entropy  immediately  indicate  what  type  of  inequalities 
we  should  prove  to  obtain  subgaussian  concentration  inequalities.  Informally, 
suppose  we  can  prove  an  inequality  in  one  dimension  of  the  form 

“  entropy(e9)  <  E[  |gradient(g)|2  e9].  ” 

Then  we  obtain  for  product  measures  in  any  dimension,  by  tensorization, 

“  entropy(eA^)  <  E[  ||  gradient  (A/)  || 2  eA^  ],  ” 

and  thus  /  is  subgaussian  with  variance  proxy  of  order  ||  1 1  gradient  ( /)  1 1 2 1 1  oo  - 
This  is  precisely  the  subgaussian  counterpart  of  the  Poincare  inequalities 

“  variance(/)  <  E[  ||gradient(/)||2  ]  ” 

that  were  obtained  in  Chapter  2.  The  entropy  inequalities  informally  described 
above  are  one  form  of  a  class  of  inequalities  called  log-Sobolev  inequalities.  In 
the  next  section,  we  will  develop  a  general  framework  for  understanding  and 
proving  log-Sobolev  inequalities  that  is  similar  to  (but  less  powerful  than)  the 
theory  developed  in  Chapter  2  for  Poincare  inequalities. 

As  a  first  illustration  of  the  entropy  method,  let  us  prove  a  log-Sobolev 
counterpart  of  the  trivial  variance  inequality  of  Lemma  2.1. 

Lemma  3.16  (Discrete  log-Sobolev).  Let  D~  f  :=  f  —  inf  /.  Then 

Ent[e-^]  <  Cov[/,  e^\  <  E[|D“/|2e^]. 


3.3  The  entropy  method 


57 


Remark  3.17.  The  constant  in  this  inequality  is  not  optimal.  Improved  con¬ 
stants  will  be  derived  in  Problem  3.13  below.  The  suboptimal  result  is  given 
here  as  its  simple  proof  seems  the  most  intuitive  and  insightful. 

Proof.  Note  that  logE[e^]  >  E[/]  by  Jensen’s  inequality.  Therefore 

Ent[ef]  =  E [fef]  -  E[e/]  log  E[ef]  <  E [fef]  -  E[/]E[e^]  =  Cav[f,ef]. 

To  prove  the  second  part,  note  that 

Co >v[f,ef]  =  E[(/  -  inf  /)(e/  -  E[e^])]  <  E[(/  -  inf /)(e/  -  emf/)]. 

Since  ex  is  convex,  the  first-order  condition  for  convexity  implies  e?  —  elnf  J  < 
e*  if  —  inf  /).  Substituting  into  the  above  expression  completes  the  proof.  □ 

We  can  now  obtain  Gaussian  tail  bounds  in  terms  of  one-sided  differences 

DJ  fix)  :=  f{x  -  ini  fixi, . . .  z,xi+1, . . .  ,xn), 

Z 

Df  fix)  :=  sup/(xi, . .  ...,xn)~  fix  i,  ...,xn) 

Z 

by  combining  the  discrete  log-Sobolev  inequality  with  tensorization  of  entropy. 

Theorem  3.18  (Bounded  difference  inequality).  For  all  t  >  0 

P [fiXu  ...,Xn)>  E/(Xr, . . .  ,Xn)  +  t]  <  e-‘2/411  W7 f\2W~ , 

P  [fiXu  ...,Xn)<  E/(Xr  ,...,Xn)-t}<  e-i2/411  £?=* \Dtf\2\U 

whenever  Xi,...,Xn  are  independent.  In  particular,  the  random  variable 
fiX i, . . .  ,Xn)  is  subgaussian  with  variance  proxy  2||  J2i= i  |£//|2||oo- 

Proof.  By  Lemma  3.16,  we  have 

Entiled  <  E[|£>r/|  V|X1} . . .  ,  A-r,  X,rl, ...,  Xn]. 

Thus  we  have  for  A  >  0 

n  "In 

Ent[eA/]<A2E  £  \D~  f\2exf  <  A2  £  IA”/|2  E[eXf] 

.  i— 1  J  *=  1  oo 

using  the  tensorization  Theorem  3.14,  where  we  used  that  D~(\f)  =  XD~  f 
for  A  >  0.  Thus  Lemma  3.13  and  the  Chernoff  bound  yields  the  upper  tail 
bound.  The  lower  tail  bound  is  obtained  by  applying  the  upper  tail  bound  to 
-/  and  noting  that  D~ (-/)  =  -D+ f.  As  \D~  f\  <  |  A/I  and  | Df  f\  <  \DJ\, 
the  subgaussian  property  is  deduced  identically  from  Lemma  3.13.  □ 
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The  bounds  of  Theorem  3.18  are  a  vast  improvement  over  McDiarmid’s 
inequality  of  Theorem  3.11:  here  the  variance  proxy  is  a  genuine  upper  bound 
on  the  square  gradient  ||  X]"=i  I-Ch/Plloo,  while  in  McDiarmid’s  inequality  the 
gradient  must  be  bounded  coordinatewisc  ll-^i/lloo-  We  also  obtain  finer 

bounds  in  terms  of  one-sided  differences,  which  is  important  in  many  applica¬ 
tions.  What  enables  these  improved  bounds  is  that  the  log-Sobolev  inequality 
tensorizes  much  more  efficiently  than  the  subgaussian  property  itself.  Indeed, 
note  how  we  have  kept  the  gradient  inside  the  expectation  throughout  the 
tensorization  process,  and  only  took  its  uniform  norm  at  the  end  to  obtain  a 
subgaussian  inequality;  had  we  directly  tensorizecl  the  subgaussian  bound  of 
Lemma  3.13,  we  would  only  be  able  to  recover  McDiarmid’s  inequality. 

On  the  other  hand,  unlike  in  the  previous  bounds  we  have  encountered, 
we  see  here  an  important  case  where  the  upper  and  lower  tail  bounds  are  not 
symmetric:  the  upper  tail  bound  is  given  in  terms  of  the  negative  gradient 
D~ /,  while  the  lower  tail  bound  is  given  in  terms  of  the  positive  gradient  Df  f. 
There  are  applications  where  only  one  of  these  quantities  can  be  controlled. 

Example  3.19  (Random  matrices).  We  recall  the  setting  of  Example  2.5.  Let 
M  be  an  n  x  n  symmetric  matrix  where  [Mjj  :  i  >  j}  are  i.i.d.  symmetric 
Bernoulli  random  variables  P [M^-  =  ±1]  =  We  denote  by  A max(M)  the 
largest  eigenvalue  of  M,  and  by  vmSL X(M)  a  corresponding  eigenvector. 

It  was  shown  in  Example  2.5  that 


(Af)  <  4|w(M)i||  ^max  (M)j 


Thus  we  can  estimate 


E  lAyAmax(M)P 
i,j=  1 


<  16 


E  lt;max(Af)i| 


2=1 


=  16, 


and  we  therefore  obtain  by  Theorem  3.18  the  upper  tail  bound 
P[Amax(M)  -  EAmax(M)  >t}<  e-t2/64 


for  all  t  >  0.  This  is  a  much  sharper  control  of  the  fluctuations  above  the 
mean  in  comparison  to  the  variance  bound  of  Example  2.5. 

On  the  other  hand,  we  cannot  use  Theorem  3.18  to  control  the  fluctuations 
below  the  mean.  Indeed,  for  the  positive  gradient,  we  can  compute 

D+Amax(M)  <  4|umax(M«))i||umax(M«))j| 

as  in  Example  2.5,  where  is  the  matrix  such  that  is 

chosen  to  maximize  Amax(Af)  while  the  remaining  entries  are  kept  fixed.  Now 
there  is  no  reason  to  expect  that  |umax(M^)),;|2  is  bounded  uniformly 

in  the  dimension  (as  a  different  matrix  A/ -'A  is  chosen  for  every  entry  i),  and 
thus  we  cannot  obtain  a  dimension-free  lower  tail  bound  in  this  manner. 
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It  does  not  seem  to  be  possible  to  prove  a  subgaussian  lower  tail  bound 
in  terms  of  D~  f  (or,  equivalently,  an  upper  tail  bound  in  terms  of  Df  /). 
It  is  instructive  to  attempt  to  repeat  the  proof  of  the  discrete  log-Sobolev 
inequality  of  Lemma  3.16  in  terms  of  the  positive  gradient:  this  gives  at  best 

Ent[e']  <E[|JD+/|2]E[e/], 

which  does  not  behave  well  under  tensorization.  Thus  the  situation  is  inher¬ 
ently  asymmetric.  However,  in  many  examples  where  the  negative  gradient 
D~  f  can  be  controlled,  it  turns  out  that  in  fact  a  stronger  property  holds  as 
well  that  makes  it  possible  to  obtain  both  upper  and  lower  tail  bounds  using  a 
result  known  as  Talagrand’s  concentration  inequality.  The  machinery  needed 
to  prove  such  bounds  will  be  discussed  in  the  next  chapter. 

Problems 

3.12  (Subgaussian  variables  and  entropy).  Lemma  3.13  states  that  if 

\2  2 

Ent[eAA]  <  ^ -  E[eAA]  for  all  AeR, 

then  the  random  variable  X  is  er2-subgaussian.  Prove  the  following  converse 
2 

implication:  if  X  is  ^-subgaussian,  then  the  above  entropy  inequality  holds. 
We  may  therefore  view  this  property  as  yet  another  equivalent  formulation  of 
the  subgaussian  property  in  the  spirit  of  Problem  3.1. 

Hint:  Note  that  Ent[eAA']/E[eAA]  =  E[ZlogZ]  for  Z  =  eAA/E[eAA].  Now  use 
concavity  of  the  logarithm  and  that  E[eA’A'~EAd]  >  1  (why?). 

3.13  (Optimal  discrete  log-Sobolev  constants).  The  discrete  log-Sobolev 
inequality  of  Lemma  3.16  yields  a  bounded  difference  inequality  with  variance 
proxy  2||  Y^i=i  \D~  f\'2\\oo-  The  constant  is  not  optimal:  in  view  of  the  bounded 
difference  inequality  for  the  variance  (Corollary  2.4),  we  would  expect  a  vari¬ 
ance  proxy  ||  Y^i= i  |-D“/|2||oo  without  the  additional  factor  2.  Moreover,  in 
terms  of  the  two-sided  difference,  we  would  expect  |||  Y^i= l  |-Di/|2||oo  which 
gains  an  additional  factor  ? .  It  turns  out  that  a  more  refined  proof  of  the  dis¬ 
crete  log-Sobolev  inequality  can  attain  these  improved  numerical  constants. 

One  place  where  we  lose  in  the  proof  of  Lemma  3.16  is  in  estimating  the 
entropy  by  a  covariance.  Instead,  we  can  use  a  variational  principle  for  the 
entropy  to  obtain  an  improved  upper  bound.  Of  course,  Lemma  3.15  is  useless 
for  this  purpose  as  it  can  only  yield  lower  bounds  on  the  entropy. 

a.  Prove  the  following  variational  principle: 

Ent[Z]  =  inf  E  [Z  log  Z  —  Z  log  t  —  Z  +  t] . 

b.  Use  the  above  variational  principle  to  show  that 

Ent[e^]  <  E [ip(D~  f)eJ],  ip(x)  :=  e~x  +  x  —  1. 
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c.  Show  that  ip(x)  <  for  x  >  0,  and  use  it  to  improve  Lemma  3.16  to 

Ent[e^]  <  ^  E[|£)_/|2e^]. 

d.  We  now  consider  the  two-sided  gradient  Df  =  sup  /  —  inf  /.  Use  the  bound 
ip"(  A)  <  (Df)2  /  4  on  the  log-moment  generating  function  from  the  proof  of 
Lemma  3.6  and  reason  as  in  the  proof  of  Lemma  3.13  to  show  that 

Ent[e']<  *E[|,D/|V]. 

Hint:  express  Ent[eA^]/E[eA^]  in  terms  of  ip(X)  and  its  derivative  and  apply 
the  fundamental  theorem  of  calculus. 

3.14  (Rademacher  processes).  Let  £i,...,en  be  independent  symmetric 
Bernoulli  random  variables  Pfe*  =  ±1]  =  and  let  T  C  R".  Define 

n 

Z  =  suPy>fc4- 


Show  that  for  t  >  0 

n 

P [Z  —  EZ  >  t]  <  e~t2/4a2  with  cr2  =  4 sup  V  t\. 

t6Tfe= i 

This  is  a  crucial  improvement  over  the  result  obtained  in  Problem  3.7  using 
McDiarmid’s  inequality.  However,  here  we  only  obtain  an  upper  tail  bound: 
Talagrand’s  inequality  is  needed  to  obtain  a  matching  lower  tail. 

3.15  (Convex  log-Sobolev).  Show  that  for  a  convex  function  /  :  [a,  b\  — »  R. 

Ent[e^]<(6-a)2E[|/,|V], 

where  /'  is  the  calculus  (not  discrete)  derivative.  Conclude  that  if  /  :  R"  — >  R 
is  convex  and  L-Lipschitz,  i.e.,  |/(x)  —  f(y) \  <  L ||x  —  y\\  for  all  x,  y  G  R",  and 
if  X\, . . . ,  Xn  are  independent  with  values  in  [a,  6],  then  for  every  t  >  0 

P  [/(*!,  ...,Xn)~  E  f(Xu  ...,xn)>t}<  e-*2/4(fc-a)2L2  _ 

Note  that  this  does  not  yield  a  lower  tail  bound:  if  /  is  convex,  — /  is  concave. 
Hint:  Recall  Problem  2.5. 

3.16  (Exponential  Poincare  inequalities).  In  this  problem,  we  will  as¬ 
sume  the  validity  of  a  general  kind  of  log-Sobolev  inequality  of  the  form 

Ent[eW]  <  ^  E|T(/)eA^] 
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for  A  >  0,  where  r(f)  is  some  suitable  notion  of  “||gradient(/)||2.”  Such  an 
inequality  can  be  used  to  prove  that  /  is  ||r,(/)||^0-subgaussian  using  Lemma 
3.13.  In  this  problem,  we  will  show  that  it  is  in  fact  possible  to  obtain  more 
precise  control  on  the  moment  generating  function  of  /.  In  fact,  we  will  prove 

F,[ef~Ef]  <  E[er(/)], 

which  could  be  viewed  as  an  “exponential  Poincare  inequality.” 

a.  Show  that 

Ent[eA/]  >  A2E[E(/)eA/]  -E[eA/]logE[eA2r(/)]. 

Hint:  use  the  variational  formula  for  entropy. 

b.  Use  the  log-Sobolev  inequality  to  show  that 

Ent[eA/]  <  A2q(A2)  E[eAi],  q(s)  =  log  ||er(/)||s. 

c.  Prove  the  exponential  Poincare  inequality  E[e^~E^]  <  E[er^j. 

3.4  Log-Sobolev  inequalities 

In  the  previous  section,  we  have  seen  that  one  can  prove  dimension-free  sub- 
gaussian  concentration  inequalities  by  establishing  log-Sobolev  inequalities. 
We  proved  a  simple  discrete  log-Sobolev  inequality  using  elementary  meth¬ 
ods,  and  used  it  to  obtain  subgaussian  analogues  of  the  bounded  difference 
inequalities  for  the  variance  of  section  2.1.  As  in  the  case  of  the  variance,  how¬ 
ever,  we  would  like  to  develop  machinery  to  prove  log-Sobolev  inequalities  in 
different  settings  and  with  respect  to  different  notions  of  gradient. 

In  this  section,  we  will  develop  a  partial  log-Sobolev  analogue  of  the  pow¬ 
erful  Markov  process  machinery  developed  in  section  2.3  to  prove  Poincare 
inequalities:  we  will  show  that  the  validity  of  a  log-Sobolev  inequality  for  a 
measure  /i  is  intimately  connected  to  exponential  convergence  of  a  Markov 
semigroup  to  its  stationary  measure  y,  in  the  sense  of  entropy  (rather  than 
in  L2(/z),  which  would  only  yield  a  Poincare  inequality  as  in  section  2.3). 
To  be  precise,  we  will  prove  an  entropic  analogue  of  the  “easy”  implications 
3  =>  1  <t=>  2  of  Theorem  2.18  whose  proofs  do  not  require  reversibility.  It  is 
not  too  surprising  that  we  cannot  reproduce  the  remaining  implications  in 
the  entropic  setting:  exploiting  reversibility  essentially  requires  the  structure 
of  L2(/i),  while  entropy  (unlike  the  variance)  is  not  an  L2(/i)  notion  (in  the 
context  of  Remark  2.36,  note  that  the  entropy  is  not  naturally  expressed  in 
terms  of  the  spectrum  of  the  generator).  As  a  consequence,  our  log-Sobolev 
analogue  of  Theorem  2.18  is  significantly  less  powerful  than  its  Poincare  coun¬ 
terpart.  Nonetheless,  we  will  see  that  this  approach  remains  extremely  useful, 
particularly  in  the  setting  of  continuous  distributions. 

In  the  sequel,  we  define  Ent^J/]  :=  /j,(f  log  f)  —  fiflogfif. 
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Theorem  3.20  (Log-Sobolev  inequality).  Let  Pt  be  a  Markov  semigroup 
with  stationary  measure  fi.  The  following  are  equivalent: 

1.  EntM[/]  <  c£(log/, /)  for  all  f  (log-Sobolev  inequality). 

2.  Ent n[Ptf]  <  e_*/cEntM[/]  for  all  f,t  (entropic  exponential  ergodicity). 

Moreover,  z/EntM  [Pt/]  — >  0  as  t  — >  oo  (entropic  ergodicity),  then 

3.  £(logPt/,Pt/)  <  e~t/c£(log/,/)  for  all  f,t 
implies  1  and  2  above. 


Proof.  An  elementary  computation  yields 

^EntM[PJ]  =  /i(i?Pt/  log  Ptf)  +  n{&Ptf)  =  -£(log  Ptf,  Ptf), 

where  we  have  used  that  p(H?Ptf)  =  4zp.(Ptf)  =  fknf  =  0-  We  now  prove: 

•  3  =>  1:  By  the  fundamental  theorem  of  calculus,  3  implies 

nOO  nOO 

Ent „[/]=/  £(logPt/,Pt/)df<£(log/,/)  /  e~t^cdt  =  c£(log  /,  /). 

Jo  Jo 

•  1  2:  Assuming  1,  we  obtain  2  directly  from 

^EntM[Pt/]  =  -£(log  Ptf,  Ptf)  <  —  ~EntM[P(/]. 

•  2  =>  1:  Assuming  2,  we  can  compute 


£(log  /,  /)  =  lim  EntM^ — Entviptf]  >  Hm  1 — ^!EntM[/]. 
v  no  t  ~  no  t 


This  completes  the  proof.  □ 

As  in  section  2.3,  it  may  not  be  obvious  at  first  sight  why  the  inequality 
Ent;i  [/]  <  c£(log/,  /)  should  be  viewed  as  a  log-Sobolev  inequality  in  the 
sense  that  we  introduced  in  the  previous  section.  Once  we  consider  some 
illuminating  examples,  it  should  become  clear  that  this  is  indeed  the  case. 

Example  3.21  (Discrete  log-Sobolev  inequality) .  Let  /.t  be  any  probability  mea¬ 
sure,  and  define  a  Markov  process  Xt  as  follows: 

•  Draw  Xq  ~  /i. 

•  Let  Nt  be  a  Poisson  process  with  rate  1,  independent  of  X0.  Each  time  Nt 
jumps,  replace  the  current  value  of  Xt  by  an  independent  sample  from  p. 

This  is  nothing  other  than  the  case  n  =  1  of  the  ergodic  Markov  process 
defined  in  Example  2.29.  In  particular,  it  is  easily  seen  that  /r  is  the  stationary 
measure  of  Xt,  and  that  its  semigroup  and  Dirichlet  form  are  given  by 
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Ptf  =  e"‘/  +  (1  -  e-'M,  £(/,<?)  =  Co v„[/,$]. 

Now  note  that,  by  the  convexity  of  x  i— >  a;  log  a;, 

Ptf  log  Ptf  <  e-*  f  log  /  +  (1  -  e"*)  nf  log  ptf. 

Thus  we  have 

Ent  M[Pt/]  =  fi(Ptf  log  Ptf)  -  nf  log /if  <  e_tEntJlt[/], 
and  we  conclude  by  implication  2  =>  1  of  Theorem  3.20  that 
EntM  [/]  <  £(log  /,/)  =  CovM[log  /,  /]. 

Replacing  /  by  e9,  we  see  that  we  have  obtained  the  discrete  log-Sobolev 
inequality  of  Lemma  3.16  as  a  special  case  of  Theorem  3.20. 

Remark  3.22.  We  have  seen  in  Example  2.29  that  the  characterization  of 
Poincare  inequalities  of  Theorem  2.18  is  sufficiently  powerful  to  reproduce 
that  tensorization  inequality  for  variance.  In  contrast,  in  view  of  the  above 
example,  we  see  that  Theorem  3.20  cannot  reproduce  the  tensorization  in¬ 
equality  for  entropy.  Indeed,  extending  the  above  example  to  the  setting  of 
Example  2.29,  we  can  obtain  at  best  an  inequality  of  the  form 

n 

Eut[f(X1,...,Xn)]<E  ^Cov2[log/,  f\(X1,...,Xn) 

.  i=i 

which  has  covariances  on  the  right-hand  side  instead  of  entropies  (that  is, 
Theorem  3.20  yields  a  combination  of  the  tensorization  Theorem  3.14  and  the 
discrete  log-Sobolev  inequality  of  Lemma  3.16).  Thus  the  result  of  Theorem 
3.20  is  inherently  less  complete  than  that  of  Theorem  2.18.  On  the  other  hand, 
Theorem  3.20  still  provides  a  powerful  tool  to  prove  log-Sobolev  inequalities. 
This  is  particularly  useful  in  the  continuous  case,  as  we  will  see  presently. 

Example  3.23  (Gaussian  log-Sobolev  inequality).  Let  us  prove  a  log-Sobolev 
inequality  for  the  standard  Gaussian  distribution  pi  =  7V(0, 1)  in  one  dimen¬ 
sion  (we  will  subsequently  use  tensorization  to  extend  to  higher  dimensions). 
To  this  end,  we  will  again  use  the  Ornstein-Uhlenbeck  process  Xt  introduced 
in  Example  2.22.  In  particular,  we  recall  two  important  properties  of  the 
Ornstein-Uhlenbeck  process  that  were  proved  in  Example  2.22: 

£(/,<?)  =  M/V),  (Ptf)' =  e~tPtf. 

Using  these  properties,  we  will  now  proceed  to  prove  a  log-Sobolev  inequality. 
Note  that  (log  /)'/'  =  \f\2/f-  We  therefore  have 

(log  Ptf)' (Ptf)'  =  e-2t  M"- 
bJ 
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By  Cauchy-Schwarz,  we  obtain 

\ptf\2  =  pt  p  <  pt  ppj  pj  =  Pt ((i0g  fm  ptf, 

and  consequently 

(log  Pt/)/(Pt/)/  <  e_2t  Pt((log/),//). 

Integrating  with  respect  to  p  on  both  sides  yields 

£(log  Ptf,  Ptf)  <e-2t£(log/,  /), 
and  thus  the  implication  3  =>  1  of  Theorem  3.20  yields 

EntM[/]  <  ^£(log  /,  /). 

This  is  the  log-Sobolev  inequality  for  the  Gaussian  distribution. 

Having  proved  the  Gaussian  log-Sobolev  inequality  in  one  dimension,  we 
immediately  obtain  an  n-dimensional  inequality  by  tensorization. 

Theorem  3.24  (Gaussian  log-Sobolev  inequality).  Let  Xi, . . . ,  Xn  be  in¬ 
dependent  Gaussian  random  variables  with  zero  mean  and  unit  variance.  Then 

Ent [f(X1,...,Xn)\  <p[Xf(X1,...,Xn)-Xlogf(X1,...,Xn)} 

for  every  f  >  0. 

Why  is  this  a  log-Sobolev  inequality  in  the  sense  of  the  previous  section? 
Note  that,  by  the  chain  rule,  the  inequality  of  Theorem  3.24  is  equivalent  to 

Entle**1’-*")]  <  iE[||V/(X1,...,X„)||2e^1--’x")] 

for  every  /.  This  is  precisely  the  type  of  inequality  that  arises  in  the  previous 
section.  In  particular,  in  this  form,  it  is  immediately  evident  that  Theorem  3.24 
provides  the  key  ingredient  to  prove  a  Gaussian  concentration  inequality.  The 
following  result  is  one  of  the  most  important  properties  of  Gaussian  variables. 

Theorem  3.25  (Gaussian  concentration).  Let  X1; . . . ,  Xn  be  independent 
Gaussian  random  variables  with  zero  mean  and  unit  variance.  Then 

P[f(X1,...,Xn)--Ef(Xu...,Xn)  >t]  <e-t2/2ff2 

for  all  t  >  0,  where  a2  =  ||  ||  V/||  2||oo  .  In  fact,  f(X i, . . . ,  Xn)  is  a2 -subgaussian. 
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Proof.  By  Theorem  3.24  and  the  chain  rule,  we  can  estimate 

Ent[eA'<Xl>--*»>]  <  E[eA/(*i.-.*»)] 

for  all  AgR.  The  result  now  follows  from  Lemma  3.13.  □ 


Remark  3.26.  In  the  Gaussian  case,  we  have  seen  several  different  forms  of  the 
log-Sobolev  inequality.  Beside  the  form  as  stated  in  Theorem  3.24 

Ent[/]  <  \  E[V/  •  V  log  /]  =  ^£(log  /,  /) 

(which  corresponds  to  the  inequality  in  Theorem  3.20),  we  can  write 


Ent[/]  <  ^  E 


'liv/ll2' 

.  /  . 


(which  is  in  fact  the  form  that  was  used  in  the  proof  of  Theorem  3.24),  or 

Ent[e']<^E[||V/||V] 


(which  was  used  in  the  proof  of  Theorem  3.25).  Another  equivalent  form  is 


Ent[/2]  <  2E[||V/||2]  =  2£(/, /). 


The  latter  form  is  in  fact  the  “classical”  form  of  the  log-Sobolev  inequality  as 
it  is  found  in  the  analysis  literature.  For  the  Gaussian  case,  all  these  forms 
of  the  log-Sobolev  inequality  are  equivalent  due  to  the  fact  that  the  Dirichlet 
form  is  given  in  terms  of  a  gradient  that  satisfies  the  chain  rule.  This  is  not  the 
case  in  general,  however:  for  many  Markov  processes  (such  as  in  Remark  3.21) 
the  Dirichlet  form  does  not  satisfy  the  chain  rule  property,  and  in  this  case  the 
above  inequalities  are  typically  not  equivalent  to  one  another.  Nonetheless,  it 
is  often  possible  to  deduce  useful  forms  of  these  inequalities  even  in  the  absence 
of  the  chain  rule,  as  we  did,  for  example,  in  the  proof  of  Lemma  3.16.  We  will 
loosely  refer  to  all  inequalities  of  this  kind  as  “log-Sobolev  inequalities.” 

Remark  3.27.  The  “classical”  form  of  the  log-Sobolev  inequality  reads 

E[/2log/]-E[/2]log||/||2<c||V/||2, 


while  the  Poincare  inequality  reads 


E[/2]-E[/]2<c||V/||2. 


When  viewed  in  this  manner,  the  log-Sobolev  inequality  looks  only  slightly 
stronger  than  the  Poincare  inequality:  the  latter  controls  the  L-norm  of  a 
function  by  the  L-norm  of  its  gradient,  while  the  former  controls  the  function 
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in  a  slightly  stronger  (by  a  logarithmic  factor)  L2  log  L-norrn. 1  As  we  have 
seen,  this  apparently  minor  improvement  has  far-reaching  consequences. 

In  classical  analysis,  an  important  role  is  played  by  Sobolev  inequalities 
that  have  the  form  ||/  —  E/||g  <  c||V/||2  for  q  >  2.  Such  inequalities  are  even 
better  than  log-Sobolev  inequalities:  they  ensure  that  the  Lq- norm  of  function 
is  controlled  by  the  L-norm  of  its  gradient,  while  log-Sobolev  inequalities  only 
improve  over  L2  by  a  logarithmic  factor  (hence  the  name).  However,  unlike 
log-Sobolev  inequalities,  classical  Sobolev  inequalities  do  not  tensorize.  It  is 
for  this  reason  that  log-Sobolev  inequalities  are  much  more  important  than 
classical  Sobolev  inequalities  in  high-dimensional  probability. 

In  view  of  the  previous  remark,  it  is  natural  to  conclude  that  log-Sobolev 
inequalities  are  strictly  stronger  than  Poincare  inequalities,  but  this  is  not 
entirely  obvious.  We  conclude  this  section  by  showing  that  this  is  indeed 
the  case,  even  in  the  more  general  setting  of  Theorem  3.20.  This  clarifies,  in 
particular,  that  the  methods  developed  in  this  chapter  to  prove  concentration 
inequalities  can  be  viewed  in  a  precise  sense  as  direct  extensions  of  the  theory 
developed  in  the  previous  chapter  to  prove  variance  bounds. 

Lemma  3.28.  The  log-Sobolev  inequality  Ent[/]  <  c£(log/, /)  for  all  f  >  0 
implies  the  Poincare  inequality  Var [/]  <  2 c£(/, /)  for  all  f. 

Proof.  The  log-Sobolev  inequality  states  for  A  >  0 

E[A/eA/]  -E[eA/]logE[eA/]  <  c£(A/,eA/). 

As  £(/,  1)  =0,  we  can  estimate 

£(A/,eA0  =  A2£(/,/)  +  o(A2), 

while  we  have 

E[A/eA']  =  AE[/]  +  A2E[/2]  +  o(A2), 

and 

E[eA/]logE[eA/]  =  AE[/]  +  A2{E[/2]  +  E[/]2}/2  +  o(A2). 

Thus  we  obtain  the  Poincare  inequality  Var[/]  <  2 c£(/, /)  by  dividing  the 
log-Sobolev  inequality  Ent[eA^]  <  c£(A/,  eA^)  by  A2  and  letting  A  J,  0.  □ 


Problems 

3.17  (Relative  entropy  convergence).  As  Theorem  3.20  does  not  require 
Pt  to  be  reversible,  the  log-Sobolev  inequality  EntM[/]  <  c£(log/, /)  is  not 
necessarily  equivalent  to  the  reverse  inequality  EntM[/]  <  c£(/, log/).  There 
is,  however,  a  dual  form  of  Theorem  3.20  that  will  yield  the  latter. 

1  While  the  idea  expressed  here  is  intuitive,  it  should  be  noted  that  entropy  is  not 
a  norm.  However,  the  statement  can  be  made  precise  in  terms  of  Orlicz  norms. 
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Define  the  relative  entropy  between  probability  measures  v  and  p  as 

D(v\\p)  :=  Entu  —  for  v  <C  p, 

[  dp 

and  D{v\\p)  :=  oo  otherwise.  The  relative  entropy  should  be  viewed  as  a 
notion  of  “distance”  between  probability  measures:  in  particular  D(u\\p)  >  0 
and  D(y\\p)  =  0  if  and  only  of  p  =  v.  Note,  however,  that  D(v\\p)  is  not  a 
metric  (it  is  neither  symmetric,  nor  does  it  satisfy  a  triangle  inequality).  The 
relative  entropy  will  play  an  important  role  in  the  next  chapter. 

For  every  probability  measure  u,  we  can  define  the  probability  measure  vPt 
by  setting  (isPt)f  =  u(Ptf)  for  every  function  /.  Note  that  vPt  is  precisely  the 
law  of  Xt  given  that  the  initial  condition  Xq  is  drawn  from  v.  indeed,  if  Xq  ~  v, 
then  uPtf  =  E [Ptf(X0)]  =  E[E[/(Xt)|X0]]  =  E [f(Xt)].  In  particular,  the 
stationary  measure  p  satisfies,  by  its  definition,  pPt  =  p  for  all  t. 

a.  Let  h=  Show  that  D{vPt\\p)  =  EntM [P*h],  where  Pt*  is  the  adjoint  of 
the  semigroup  Pt  (that  is,  ( f,Ptg )M  =  (Pt*f,g)^  for  all  f,g). 

b.  Show  that  the  log-Sobolev  inequality 

EntM  [/]  <  c£(/,  log  /)  for  all  / 

holds  if  and  only  if  Pt  is  exponentially  ergodic  in  relative  entropy: 

D{vPt\\p)  <  e~t/cD(h'\\p)  for  all  t,  u. 

3.18  (Norms  of  Gaussian  vectors).  The  goal  of  this  problem  is  to  prove 
some  classical  results  about  norms  of  Gaussian  vectors.  We  begin  with  a  simple 
but  important  consequence  of  Gaussian  concentration. 

a.  Let  X  ~  iV(0,  X)  be  an  n-dimensional  centered  Gaussian  vector  with  arbi¬ 
trary  covariance  matrix  S.  Prove  that  (see  Problem  2.8  for  a  hint) 

max  X,:  is  r2  :=  max  Var[X,]-subgaussian. 

b.  Show  that  the  mean  and  median  of  max,  Xt  satisfy 

E  max  X;  <  mecl  max  X, 

Hint:  estimate  P [max,;  X^  >  E [max,  Xj]  —  t]  from  below  for  t  =  T\f\ 2  log  2. 

Let  ( B ,  ||  •  || b)  be  a  separable  Banach  space,  and  let  X  be  a  centered  Gaussian 
vector  in  B  (that  is,  X  £  B  and  (v,X)  is  a  Gaussian  random  variable  for 
every  element  v  £  B*  in  the  dual  space  of  B).  Recall  that  the  norm  satisfies 

IMIb  =  sup  (v,x) 

v(zB*  ,||p||  <1 
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by  duality.  Moreover,  as  B  is  separable,  the  supremum  in  this  expression  can 
be  restricted  to  a  countable  dense  subset  V  C  B*  (independent  of  x).  Define 

a2  :=  sup  E[(ti,X)2]. 

c.  Show  that  a  <  oo,  E||X||s  <  oo,  and  that  ||X||b  is  cr2-subgaussian. 

Hint:  med[|(u, X)|]  <  med[||X||B]  <  oo  for  all  v  €  B* ,  ||u||  <  1. 

d.  Prove  the  Landau-Shepp-Marcus-Fernique  theorem: 

E[e"llxHB]  <oo  if  and  only  if  a  <  — ^ . 

Hint:  for  the  only  if  part,  use  E[e“^YHB]  >  E[ea^v’x^  ]  for  v  €  B*,  ||i>||  <  1. 

3.19  (Bakry-Emery  criterion).  In  Problems  2.12  and  2.13  (we  adopt  the 
notation  used  there),  we  showed  that  the  Bakry-Emery  criterion  c/^)/,  /)  > 
r(f,  f)  provides  an  algebraic  criterion  for  the  validity  of  the  Poincare  inequal¬ 
ity.  However,  the  Bakry-Emery  criterion  is  strictly  stronger  than  the  validity 
of  a  Poincare  inequality.  In  the  present  problem,  we  will  show  that  if  the 
Markov  semigroup  is  reversible  and  its  carre  du  champ  satisfies  a  chain  rule, 
then  the  Bakry-Emery  criterion  even  implies  validity  of  the  log-Sobolev  in¬ 
equality.  This  provides  a  very  useful  tool  for  proving  log-Sobolev  inequalities 
for  certain  classes  of  continuous  distributions. 

Let  Pt  be  a  reversible  and  ergodic  Markov  semigroup  with  stationary  mea¬ 
sure  n ,  and  assume  that  the  carre  du  champ  satisfies  the  chain  rule 

=  r{f,g)<t>'  o  g. 

For  example,  this  is  evidently  the  case  when  r(f,g)  =  V/  •  Vg. 

a.  Show  that 

£(iog  ptf,  ptf)  =  M r(Pt  log  ptf ,  /)) 

<  Pin/,  f)/f)1/2ti(fr(Pt  log  Ptf ,  Pt  log  Ptf))112. 

b.  Show  that  the  Bakry-Emery  criterion  0/2 (/,  /)  >  r(f,  f)  for  all  /  implies 

£(log  Ptf,  Ptf)  <  e-t/c£(log/,/)1/2/x(/PtF(logPt/,logPt/))1/2. 

Hint:  use  Theorem  2.37  and  the  chain  rule. 

c.  Show  that  the  above  inequality  implies 

£  (log  P*/,  Ptf)  <  e-t/c£(log  /,  /)1^2£(log  Ptf,  Ptf)1/2, 
and  thus  the  Bakry-Emery  criterion  implies  the  log-Sobolev  inequality 
Ent M[/]  <  C-  £(log  /,  /)  for  all  /. 
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d.  Let  p  be  a  p-uniformly  log-concave  probability  measure  on  R",  that  is, 
fi(dx)  =  e~w^dx  where  the  potential  function  W  satisfies  \7V*W  >z  p Id. 
Show  that  p  satisfies  the  dimension-free  log-Sobolev  inequality 

Ent M[/2]  <  Ijwffdp. 

Hint:  see  Problem  2.13. 

Remark.  In  the  setting  of  this  problem,  it  is  in  fact  possible  after  some  further 
work  to  show  that  the  Bakry-Emery  criterion  is  equivalent  to  the  validity  of 
a  local  log-Sobolev  inequality,  which  strengthens  the  result  of  Theorem  2.37 
under  the  chain  rule  assumption.  We  omit  the  details. 

3.20  (Bounded  perturbations).  Let  p  be  a  probability  measure  for  which 
we  have  proved  a  log-Sobolev  inequality.  Let  u  be  a  “small  perturbation”  of 

р.  It  is  not  entirely  obvious  that  v  will  also  satisfy  a  log-Sobolev  inequality. 
In  this  problem,  we  will  show  that  log-Sobolev  and  Poincare  inequalities  are 
stable  under  bounded  perturbations,  so  that  we  can  deduce  an  inequality  for 
v  from  the  corresponding  inequality  for  p.  This  can  be  a  useful  tool  to  prove 
log-Sobolev  or  Poincare  inequalities  in  cases  for  which  it  is  not  obvious  how 
to  proceed  by  a  direct  approach  (for  example,  using  Theorem  3.20). 

Suppose  that  p  that  satisfies  the  log-Sobolev  inequality 

Ent  „[/]  <  cp(r(f,  log/)), 

where  we  have  expressed  the  right-hand  side  in  terms  of  a  “square  gradient” 
r(f,  log  /)  >  0.  For  example,  if  p  ~  iV(0,  /),  we  choose  T(/,  g)  =  V/  •  V  log  /. 
In  the  setting  of  Theorem  3.20,  if  the  Markov  semigroup  is  reversible,  we 
can  choose  T(/, log/)  to  be  the  carre  du  champ  of  Problem  2.7;  however, 
the  present  result  is  not  specific  to  the  Markov  semigroup  setting  and  can  be 
applied  to  any  log-Sobolev  type  inequality  of  the  above  form. 

a.  Prove  the  following  identity  for  v  <C  p. 

Ent„  [X]  <  ^  Ent^X], 

U,'J  OO 

Hint:  use  the  variational  principle  of  Problem  3.13. 

b.  Suppose  that  v  is  a  bounded  perturbation  of  p  in  the  sense  that  £  <  <  S 

for  some  <5,  e  >  0.  Show  that  v  satisfies  the  log-Sobolev  inequality 

Ent„[/]  <  —  v(r(f,logf)). 

£ 

с.  Define  the  probability  measure  v{dx)  =  Z-1e~v  ^dx  on  R,  where  Z  is 
the  normalization  factor.  Suppose  that  the  potential  V (x)  is  sandwiched 
between  two  quadratic  functions:  x2  +  a  <  V(x)  <  x2  +  b  for  all  x  €  R. 
Show  that  v  satisfies  the  log-Sobolev  inequality 

En t„[/2]  <  e^-^KI/f)- 
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d.  We  have  shown  that  the  log-Sobolev  inequality  is  stable  under  bounded 
perturbations.  An  analogous  result  holds  for  Poincare  inequalities.  Indeed, 
suppose  that  fi  that  satisfies  the  Poincare  inequality 

VarM  [/]  <  cn(r(f,f)). 

Show  that  if  £  <  <  (5,  then 


£ 

Remark.  While  bounded  perturbation  results  can  be  useful,  the  constant  6/e 
can  be  quite  large  in  practice.  In  particular,  it  is  typically  the  case  that  5/e 
will  increase  exponentially  with  dimension,  so  that  the  bounded  perturbation 
method  does  not  yield  satisfactory  results  when  applied  in  high  dimension. 
However,  one  can  of  course  apply  the  bounded  perturbation  method  in  one 
dimension,  and  then  obtain  dimension-free  results  by  tensorization. 


Notes 

§3.1  and  §3.2.  Much  of  this  material  is  classical.  See,  e.g.,  [13,  26]  for  a  more 
systematic  treatment  of  subgaussian  inequalities  and  the  martingale  method. 
Theorem  3.11  was  popularized  by  McDiarmicl  [58]  for  combinatorial  problems. 

§3.3  and  §3.4.  Logarithmic  Sobolev  inequalities  were  first  systematically 
studied  by  Gross  [42],  together  with  their  connection  to  Markov  semigroups. 
A  comprehensive  treatment  is  given  in  [44]  and  in  [7]  (see  also  [12]  where 
such  connections  are  developed  in  the  discrete  setting).  The  tensorization 
property  of  entropy  also  appears  already  in  [42];  we  followed  the  proof  in 
[50].  The  variational  formula  for  entropy  plays  a  fundamental  role  in  large 
deviations  theory  [22].  Lemma  3.13  is  due  to  I.  Herbst,  but  was  apparently 
never  published  by  him.  The  entropy  method  was  systematically  applied  to  the 
development  of  concentration  inequalities  by  Ledoux  [48,  50] .  A  comprehensive 
treatment  of  the  entropy  method  for  concentration  inequalities  is  given  in  [13]. 
Problem  3.16  is  from  [11],  while  Problem  3.18  follows  the  approach  in  [49]. 
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Lipschitz  concentration  and  transportation 
inequalities 


In  the  previous  chapters,  we  have  investigated  the  concentration  phenomenon 
in  the  following  form:  the  fluctuations  of  a  function  f(X i, . . . ,  Xn )  of  indepen¬ 
dent  (or  weakly  dependent)  random  variables  are  small  if  the  “gradient”  of  / 
is  small.  In  this  chapter,  we  will  develop  a  different  perspective  on  the  concen¬ 
tration  phenomenon.  Rather  than  measuring  the  sensitivity  of  the  function  / 
in  terms  of  a  gradient,  we  will  introduce  a  metric  viewpoint  that  emphasizes 
the  role  of  Lipschitz  functions.  This  complementary  perspective  will  lead  us  to 
new  methods  to  investigate  and  prove  concentration,  and  to  new  inequalities 
that  do  not  have  a  natural  description  in  terms  of  gradients.  In  particular,  we 
will  prove  Talagrand’s  inequality,  which  is  important  in  many  applications. 


4.1  Concentration  in  metric  spaces 

Recall  a  basic  definition. 

Definition  4.1  (Lipschitz  functions).  Let  (X,d)  be  a  metric  space.  A  func¬ 
tion  f  :  X  — y  R.  is  called  L-Lipschitz  if\f(x)—f(y)\  <  L  d(x,  y)  for  all  x,  y  €  X. 
The  family  of  all  1-Lipschitz  functions  is  denoted  Lip(X). 

What  do  Lipschitz  functions  have  to  do  with  concentration?  While  we  have 
expressed  our  concentration  results  to  date  in  terms  of  gradient  bounds,  such 
results  can  often  be  interpreted  naturally  in  terms  of  Lipschitz  properties.  To 
make  this  point,  let  us  begin  by  considering  two  examples. 

Example  f.2  (Gaussian  concentration).  Let  Xi, . . .  ,Xn  be  i.i.cl.  N( 0, 1)  ran¬ 
dom  variables.  Gaussian  concentration  (Theorem  3.25)  states  that  the  ran¬ 
dom  variable  f(X- L,  . . . ,  Xn)  is  ||  ||  V/lPHoo-subgaussian.  However,  the  quantity 
|| || V/lPHoo  is  naturally  expressed  in  terms  of  a  Lipschitz  property. 

Lemma  4.3.  Let  f  :  R”  — >  ffi.  be  a  C1- function .  Then  1 1 1 1 V / 1 1 2 1 1  qq  <  L2  if  and 
only  if  |  f(x)  -  f(y)\  <  L\\x  -  y\\  for  all  x,y  G  R". 
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Proof.  Note  that  the  L-Lipschitz  property  implies 

^7  x  r  f(x  +  tv)  -  f(x) 
v-Vf(x)  =  lim - - -  <  i||u||. 

Optimizing  over  ||i>||  <  1  and  x  yields  || || V/||2 ||oo  <  L 2.  Conversely, 

f{x)-f{y)  =  J  fjff(tx+  (1  -  t)y)dt  =  J  (x-y)  ■  X  f  (tx  +  (1  -  t)y)  dt 

by  the  fundamental  theorem  of  calculus.  It  therefore  follows  readily  that  if 


|V/||2||oo  <  L2,  then  f(x)  -  f{y)  <  L\\x  -  y ||  for  all  x,y  G 


□ 


In  view  of  this  lemma,  it  follows  immediately1  that  Gaussian  concentration 
can  be  equivalently  phrased  in  terms  of  Lipschitz  functions:  if  X  ~  iV(0, /), 
then  f(X)  is  1-subgaussian  for  every  f  £  Lip(K”,  ||  •  ||). 

As  a  second  example,  let  us  revisit  McDiarmid’s  inequality. 


Example  4-4  (McDiarmid’s  inequality).  Let  Xi, . . . ,  Xn  be  independent  ran¬ 
dom  variables,  where  X,  takes  values  in  some  measurable  space  Xj  for 
i  =  1 , ,n.  McDiarmid’s  inequality  (Theorem  3.11)  states  that  the  random 
variable  f(X  i, . . .  ,Xn)  is  \  Y(k=  i  1 1  / 1 1  ^o-suibgaussiaia.  Also  this  inequality 
can  be  phrased  in  terms  of  a  Lipschitz  property.  To  this  end,  let  us  introduce 
the  weighted  Hamming  distance  dc(x,y)  on  Xi  x  •  •  •  x  X„  as 


n 

dc{x,  y )  . —  ^  )  Ci^-Xi^yi  * 


Lemma  4.5.  Let  /  :  X i  x  -  •  •  xX„  — >  K.  Then  HDj/Hoo  <  Ci  for  alii  =  1, . . . ,  n 
if  and  only  if  \f(x)  —  f(y) \  <  dc(x ,  y)  for  all  x,  y  £  Xi  x  •  •  •  x  X„. 

Proof.  Suppose  that  /  is  1-Lipschitz  with  respect  to  dc.  If  x,  y  only  differ  in 
the  itli  coordinate,  it  follows  that  \f(x)  —  f(y)\  <  Cj.  In  particular,  we  conclude 
that  that  WDifW^  <  Ci  for  all  i.  Conversely,  consider  the  telescoping  sum 


n 

f  (x)  -  f(y)  =  ^2{f(xi,...,Xi,yi+i,..  .,yn)~  f(xi, . .. ,  a?i_i,  yh . . . ,  yn)}. 

i— 1 

As  the  itli  term  in  the  sum  is  the  difference  between  /  evaluated  at  two  points 
that  differ  only  in  the  z'tli  coordinate,  it  is  bounded  by  \\Dif\\00lXijiyi.  Thus 
if  IlDj/Hoo  <  c,  for  all  i,  then  /  is  1-Lipschitz  with  respect  to  dc.  □ 

In  view  of  this  simple  observation,  we  obtain  the  following  equivalent  for¬ 
mulation  of  McDiarmid’s  inequality:  if  X  is  a  random  vector  with  independent 
entries,  f{X)  is  ^\\c\\2 -subgaussian  for  every  f  £  Lip(Xi  x  •  •  •  x  Xra,  dc). 

1  The  claim  holds  even  when  /  is  not  C 1  by  a  simple  approximation  argument:  any 
Lipschitz  function  can  be  approximated  uniformly  by  a  smooth  Lipschitz  function 
by  convolving  with  a  smooth  density.  The  details  are  left  an  an  exercise. 
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At  an  informal  level,  we  have  introduced  the  general  concentration  prin¬ 
ciple  by  stating  that  a  function  /(X1; . . .  ,Xn)  of  independent  or  weakly  de¬ 
pendent  random  variables  is  close  to  its  mean  if  the  function  /  is  “not  too 
sensitive”  to  any  of  its  coordinates.  Gradient  bounds  and  Lipschitz  properties 
provide  two  different  ways  of  making  the  informal  notion  of  “not  too  sensi¬ 
tive”  precise.  In  the  case  of  gradient  bounds,  the  sensitivity  of  the  function 
/  is  measured  locally,  while  the  Lipschitz  property  quantifies  the  sensitivity 
in  a  global  manner.  These  two  points  of  view  are  very  similar  in  spirit,  how¬ 
ever,  and  are  often  even  equivalent  as  we  have  seen  above  in  the  case  of  the 
Gaussian  concentration  inequality  and  McDiarmid’s  inequality. 

Nonetheless,  it  will  prove  to  be  extremely  useful  to  reconsider  the  concen¬ 
tration  principle  from  the  metric  perspective.  The  reasons  for  this  are  twofold: 

•  While  in  some  cases  gradient  bounds  and  Lipschitz  properties  can  be  shown 

to  be  equivalent,  there  are  other  cases  in  which  these  two  notions  are  dis¬ 
tinct.  For  example,  the  one-sided  difference  bound  ||  |F),_/|2||oo  <  L 2 

is  not  naturally  formulated  in  terms  of  a  Lipschitz  property  with  respect 
to  some  metric.  Conversely,  there  are  important  Lipschitz- type  proper¬ 
ties  that  cannot  be  naturally  formulated  in  terms  of  a  gradient;  we  will 
encounter  such  a  property  when  we  develop  Talagrand’s  concentration 
inequalities  later  in  this  chapter.  Thus  the  complementary  viewpoints  pro¬ 
vided  by  gradient  and  metric  notions  of  concentration  give  rise  to  genuinely 
different  results  that  can  be  of  substantial  importance  in  different  settings. 

•  Our  emphasis  on  gradients  in  the  previous  chapters  was  intimately  tied 
to  a  class  of  inequalities — Poincare  and  log-Sobolev  inequalities — that  are 
of  fundamental  importance  in  proving  and  understanding  concentration 
properties.  The  metric  perspective,  however,  will  require  us  to  develop  new 
types  of  inequalities  that  exploit  the  metric  structure  of  the  problem.  The 
development  of  these  ideas  will  significantly  enhance  our  understanding  of 
the  concentration  principle  and  will  provide  us  with  new  tools  to  prove 
concentration  inequalities  that  are  not  easily  obtained  by  other  methods. 

Having  roughly  motivated  the  metric  perspective  on  concentration,  we  are 
ready  to  take  some  first  steps  towards  a  general  theory. 

We  have  shown  above  that  Gaussian  concentration  can  be  phrased  as 
follows:  if  X  ~  N( 0, 1),  then  f(X)  is  1-subgaussian  for  every  /  g  Lip(R",  ||  •  ||). 
Similarly,  McDiarmid’s  inequality  states  that  if  A  is  a  random  vector  with 
independent  entries,  /(A)  is  j||c||2-subgaussian  for  /  G  Lip(X!  x  •  •  •  x  X„,  dc). 
Motivated  by  these  examples,  we  can  pose  the  following  basic  question. 

For  which  probability  measures  p  on  the  metric  space  (X,  d)  is  it  true 
that  if  X  ~  p,  then  /(A)  is  a2 -subgaussian  for  every  f  €  Lip(X)  ? 

We  presently  give  a  very  general  answer  to  this  question  in  terms  of  a  new 
class  of  inequalities  that  will  play  a  fundamental  role  throughout  this  chapter. 
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Definition  4.6  (Wasserstein  distance).  The  Wasserstein  distance  between 
probability  measures  p,  v  £  CPi(X)  :=  {p  :  f  d(x,  -)p(dx)  <  oo}  is  defined  as2 


Wi(p,v) 


sup 

fe  Lip(x) 


fdv 


Definition  4.7  (Relative  entropy).  The  relative  entropy  between  probabil¬ 
ity  measures  v  and  p  on  any  measurable  space  is  defined  as 


D(v\\p)  := 


[  Entu 

dv 

)  v 

\  oc 

dp 

*/  v  <  p, 
otherwise. 


Theorem  4.8  (Bobkov-Gotze).  Let  p  £  fPi(X)  be  a  probability  measure  on 
a  metric  space  (X,d).  Then  the  following  are  equivalent  for  X  ~  p: 

1.  f(X)  is  a2 -subgaussian  for  every  f  £  Lip(X). 

2.  Wi(is,  p)  <  ^2a2D(y\\p)  for  all  v. 

How  should  we  interpret  these  concepts?  Both  the  Wasserstein  distance 
and  the  relative  entropy  define  a  form  of  distance  between  probability  mea¬ 
sures.  The  Wasserstein  distance  defines  a  metric  in  terms  of  expectations  of 
Lipschitz  functions.  Relative  entropy,  on  the  other  hand,  is  not  a  metric:  it  is 
not  even  symmetric  and  does  not  satisfy  a  triangle  inequality.  Nonetheless,  it 
is  a  natural  measure  of  “closeness”  between  probability  measures  (for  exam¬ 
ple,  D{y\\p)  >  0  and  D(y\\p)  =  0  if  and  only  of  p  =  v).  As  we  will  see  in  the 
proof  of  Theorem  4.8,  relative  entropy  should  be  viewed  as  controlling  moment 
generating  functions  in  a  suitable  sense.  As  these  two  notions  of  distance  are 
of  an  entirely  different  nature,  there  is  no  a  priori  reason  why  relative  entropy 
and  Wasserstein  distance  to  a  given  measure  p  should  be  comparable,  and  this 
is  indeed  not  necessarily  true  for  arbitrary  p.  Theorem  4.8  states  that  rela¬ 
tive  entropy  and  Wasserstein  distance  are  comparable  precisely  when  one  can 
control  the  moment  generating  functions  of  Lipschitz  functions.  Inequalities 
such  as  Wi(u,  p)  <  ^2a2D(v\\p)  therefore  play  a  role  in  the  “metric”  setting 
analogous  to  log-Sobolev  inequalities  in  the  “gradient”  setting.  We  can  infor¬ 
mally  view  this  inequality  as  a  type  of  dual  to  the  log-Sobolev  inequality  that 
is  stated  in  terms  of  measures  rather  than  functions  (cf.  Problem  4.1  below). 

Before  we  turn  to  the  proof  of  Theorem  4.8,  let  us  illustrate  how  it  can  be 
used  to  prove  a  well-known  inequality  for  relative  entropy. 


Example  f.9  (Pinsker’s  inequality) .  Let  d(x,y)  :=  lx^y  be  the  trivial  metric. 
Then  /  £  Lip(X)  if  and  only  if  sup  /—inf  /  <  1.  Thus  the  Wasserstein  distance 
in  this  case  is  none  other  than  the  total  variation  distance 


W^p,  v) 


sup 

0</<l 


HItv 


2  Note  that  p  £  J’i(X)  if  and  only  if  f  f  dp  <  oo  for  every  /  £  Lip(X). 
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(note  that  the  quantity  inside  the  supremum  is  invariant  under  adding  a  con¬ 
stant  to  /,  so  there  is  no  loss  in  restricting  to  0  <  /  <  1  only). 

Now  recall  from  Hoeffding’s  Lemma  3.6  that  f(X )  is  \ {sup  /  —  inf /}- 
subgaussian  for  every  /  and  /i.  Thus  Theorem  4.8  implies  that 

Ha*  -  HItv  <  \J\D(v\ |/i) 

for  every  fj,,  v.  This  extremely  useful  result  is  known  as  Pinsker’s  inequality 
(which  also  provides  additional  intuition  for  the  fact  that  D(y\\fi)  can  be 
viewed  as  a  form  of  “closeness”  between  probability  measures).  Of  course,  we 
could  have  also  gone  in  the  converse  direction:  if  we  had  given  an  independent 
proof  of  Pinsker’s  inequality  (there  are  numerous  such  proofs),  then  we  could 
have  used  Theorem  4.8  to  provide  an  alternative  proof  of  Hoeffding’s  lemma. 

Let  us  now  turn  to  the  proof  of  Theorem  4.8.  The  key  insight  that  is  needed 
is  that  relative  entropy  is  intimately  related  to  moment  generating  functions; 
once  this  has  been  understood,  the  remainder  of  the  proof  of  Theorem  4.8 
is  essentially  trivial.  The  following  result,  which  dates  back  to  the  earliest 
history  of  statistical  mechanics,  makes  this  idea  precise. 

Lemma  4.10  (Gibbs  variational  principle). 

logE^e7]  =  sup{E„[/]  -  D{v\\n)}. 


Proof.  We  may  assume  /  is  bounded  above  to  avoid  integrability  problems  (if 
not,  apply  the  result  to  /  A  M  and  then  take  the  supremum  over  M).  Define 


We  have  for  D{v\\fj)  <  oo 


dfi 


e^dyi 

E  „[efY 


l°g  EM[e^]  -  D(v\\p)  =  log  ’E/Jj[ef]  -  J  ^log  ^ 0  dv 

=  log  E^e7]  -  J  ^log  ^  dv 
—  E„[/]  —  D(v\\n). 


dv 


Taking  the  supremum  over  v  on  both  sides  yields  the  result. 


□ 


Remark  ^.11.  Note  that  Lemma  3.15  can  be  reformulated  as 

D(v\\n)  =  sup{E„[/]  :  EM[e^]  =  1}  =sup{E„[/]  -  logE^^]}, 


where  the  sup  is  taken  over  functions  f.  Thus  Lemma  4.10  is  precisely  the 
dual  convex  optimization  problem  to  the  variational  formula  for  entropy. 
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We  can  now  complete  the  proof  of  Theorem  4.8. 

Proof  ( Theorem  4-8).  By  definition,  the  property  1  can  be  stated  as 

\  2  2 

logEA1[eA{/^E^/}]  <  for  all  A  G  R,  /  G  Lip(X). 

By  Lemma  4.10,  this  is  equivalent  to 

f  A2(t2  1 

sup  sup  sup  <  A{E„/  —  Em/}  —  D(v\\n) - —  >  <  0. 

AeR/eLip(X)  V  L  4  J 

Exchanging  the  order  of  the  suprema  and  evaluating  explicitly  the  suprema 
over  /  and  A  yields  that  the  above  expression  is  equivalent  to 

sup  |  ^  ^  °> 

which  is  evidently  an  immediate  reformulation  of  property  2.  □ 

Theorem  4.8  characterizes  the  subgaussian  property  of  Lipschitz  functions 
on  an  arbitrary  but  fixed  metric  space  (X,  d).  It  is  important  to  emphasize  that 
this  is  not  in  itself  a  “high-dimensional”  result.  As  in  the  previous  chapters,  the 
crucial  idea  that  will  be  needed  to  work  in  high  dimension  is  a  tensorization 
principle.  In  the  following  section,  we  will  develop  a  different  perspective  on 
the  inequality  W\ (/z,  v)  <  yj2a2D{v\\^)  that  will  enable  us  to  prove  such  a 
tensorization  principle.  This  will  provide  us  with  a  powerful  tool  to  develop 
and  understand  dimension- free  Lipschitz  concentration  inequalities. 


Problems 

4.1  (Discrete  log-Sobolev  and  Lipschitz  concentration).  One  simple 
way  to  gain  some  insight  into  the  inequality  W\(v,  /z)  <  y/2cr2D(^||/z)  is  to 
note  that  it  implies  a  sort  of  “dual”  form  of  the  discrete  log-Sobolev  inequality 
Ent[eA^]  <  Cov[A/,  eA^]  of  Lemma  3.16  for  Lipschitz  functions. 

a.  Show  that  Wi(u,n)  <  y/2cr2D(^||/.i)  implies  the  inequality 

Cov[A/,  eA/]2  <  2A2u2Ent[eA/]E[eA/]  for  A  G  M,  /  G  Lip(X). 

Hint:  consider  dv  =  eA^  d/z/EAt[eA^]. 

b.  Use  the  above  inequality  together  with  the  discrete  log-Sobolev  inequality 
of  Lemma  3.16  to  prove  that  Wi(y,jj)  <  y/2cr2D(i/||/.i)  implies  that  f(X) 
is  4cr2-subgaussian  for  X  ~  /z,  /  G  Lip(X)  (which  agrees  precisely  with  the 
result  of  Theorem  4.8  up  to  the  suboptimal  constant  4). 


4.1  Concentration  in  metric  spaces 
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4.2  (Isoperimetric  inequalities  and  concentration).  There  is  an  entirely 
different  approach  to  investigating  Lipschitz  concentration  properties  that 
played  an  important  role  in  the  historical  development  of  this  area:  the  isoperi¬ 
metric  method.  While  we  have  avoided  using  this  approach  in  this  course,  the 
method  remains  of  fundamental  importance  in  the  development  and  under¬ 
standing  of  new  concentration  phenomena.  The  goal  of  this  problem  is  to 
develop  some  basic  ideas  surrounding  this  approach. 

Let  (X,  d)  be  a  metric  space.  The  idea  behind  the  isperimetric  method  is 
not  to  investigate  the  tail  behavior  of  functions  directly,  but  rather  to  focus 
attention  to  the  probabilities  of  sets.  For  any  measurable  set  ACX,  define 
its  s-fattening  as  Ae  :=  {x  G  X  :  d(x,  A)  <  e}.  A  statement  of  the  form 

/i(Ae )  >  1  —  Ce~ c  /2<T  for  all  e  >  0,  A  C  X  such  that  n(A')  >  | 

is  called  an  isoperimetric  inequality.  It  states  that  almost  every  point  in  X  is 
£-close  to  a  set  of  measure  \ .  One  way  to  interpret  this  result  is  geometrically: 
given  any  set  A  with  fi(A)  =  the  measure  of  its  s-boundary  is  fj,(A£\A)  « 
thus  the  boundary  of  the  set  contains  almost  as  much  mass  as  the  interior  of 
the  set.  Mathematical  phenomena  relating  the  size  of  a  set  to  the  size  of  its 
boundary  are  generally  referred  to  as  “isoperimetric  problems.” 

a.  Suppose  that  the  measure  /i  satisfies  the  above  isoperimetric  inequality. 
Show  that  we  have  the  concentration  inequality 

P fj\f  —  med(/)  >  t]  <  Ce-4  /2<T  for  all  f  >  0,  /  €  Lip(X). 

Hint:  consider  the  set  A  =  {/  <  med(/)}.  Here  med(/)  denotes  the  median. 

b.  Conversely,  show  that  the  above  isoperimetric  inequality  is  implied  by 

PM[/  —  med(/)  >  t]  <  Ce-4  /2<T  for  all  t  >  0,  /  €  Lip(X). 

Hint:  consider  f(x )  =  d(x,  A). 

We  have  discovered  the  elementary  fact  that  isoperimetric  inequalities  are 
equivalent  to  tail  bounds  for  Lipschitz  functions.  However,  unlike  most  of  our 
previous  results  this  course,  the  deviation  here  is  from  the  median  rather 
than  from  the  mean.  It  turns  out  that  deviation  inequalities  from  the  median 
and  the  mean  are  equivalent,  however,  up  to  constants.  Whether  deviation 
from  the  median  or  the  mean  is  more  useful  depends  on  the  application  (see 
Problem  3.18  for  a  situation  where  the  median  provides  useful  insight). 

c.  Suppose  that  the  above  isoperimetric  inequality  holds.  Show  that 

med(/)  <  E M/  +  Co\J VJ2 
for  all  /  €  Lip(X),  and  conclude  that 

P „[/  -  E „/  >  t]  <  ec2^nee-t2/8<r2  for  all  t  >  0. 

Hint:  estimate  EM[(med(/)  —  /)+]  by  integrating  the  tail  bound. 
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d.  Conversely,  suppose  that  for  /  G  Lip(X) 

P M  -  Em/  >  t]  <  Ce~t2/2a2  for  all  t  >  0. 

Show  that  this  implies 

E M/  <  med(/)  +  ay/2\og2C 
for  all  /  G  Lip(X),  and  conclude  that 

P fj,[f  —  med(/)  >  t]  <  max{C,  (2C)1^4}e~t  for  all  t  >  0. 

Hint:  see  Problem  3.18. 

Finally,  we  develop  a  direct  connection  between  Theorem  4.8  and  isoperimetry. 

e.  Suppose  that  W\{v,  p)  <  y/2cr2D(^||yu)  for  all  v.  Argue  that 

d(A,B)  <  W^{-\A),n{.\B))  <  ^/2aHog(l/^A))  +  ^  log(l//z(S)) 
for  any  disjoint  sets  4,BCX. 

f.  Applying  the  above  result  to  B  =  X\Ae,  argue  that 

^{Ae)  >  1  —  2e-e  ^8<T  for  all  £  >  0,  A  C  X  such  that  /r(A)  > 

Thus  W\{v,n)  <  \j2o’1D(y\  |/.i)  yields  directly  an  isoperimetric  inequality. 

4.2  Transportation  inequalities  and  tensorization 

In  the  previous  section,  we  have  introduced  the  fundamental  inequality 
Wi(u, /j.)  <  ^2a2D(v\ |p)  as  a  characterization  of  the  Lipschitz  concentration 
property  on  a  fixed  metric  space.  However,  for  this  result  to  be  useful  in  high 
dimension,  we  must  understand  whether  it  is  possible  to  tensorize  inequalities 
of  this  type.  It  turns  out  that  there  is  indeed  a  tensorization  principle  that 
is  extremely  useful  in  this  setting,  but  this  is  far  from  obvious  from  the  for¬ 
mulation  developed  in  the  previous  section.  In  order  to  develop  this  idea,  it 
will  prove  to  be  necessary  to  formulate  these  inequalities  in  a  different  manner 
in  terms  of  optimal  transportation.  We  will  presently  develop  this  connection, 
and  the  tensorization  principle  that  follows  from  it. 

Optimal  transportation  is  concerned  with  the  classical  notion  of  coupling. 
Recall  that  a  coupling  of  probability  measures  of  /x,  v  is  any  joint  distribution 
of  random  variables  (X,  Y)  with  marginal  distributions  X  ~  p  and  Y  ~  u.  Of 
course,  there  exist  many  different  couplings  for  given  /i,  u. 

Definition  4.12  (Coupling).  Let  p,,v  be  two  probability  measures,  and  let 

G(p,  v )  :=  (Law(X,  Y)  :  X  ~  /x,  Y  ~  v}. 

Any  probability  measure  M  G  C(/x,  v)  is  called  a  coupling  of  p,v. 
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Let  /  G  Lip(X).  Then  for  any  M  G  C(/r,  v),  we  have 

|E „/  -  E„/|  =  |Em[/(X)  -  /(F)]  |  <  EM[d(X,F)]. 

In  particular,  we  obtain  the  elementary  inequality 

Wi{n,v)<  inf  EM[d(X,y)]. 

Mee(/i,i/) 

That  is,  the  Wasserstein  distance  is  controlled  by  the  smallest  expected  dis¬ 
tance  between  random  variables  X,  Y  such  that  X  ~  p  and  Y  ~  /i.  The  latter 
optimization  over  couplings  is  called  an  optimal  transportation  problem.  The 
name  derives  not  from  viewing  p,  v  as  probabilities  but  rather  as  distribu¬ 
tions  of  mass,  for  example,  in  a  sandpile:  the  optimal  transportation  problem 
tells  us  how  to  transform  one  sandpile  into  another  sandpile  in  a  manner  that 
minimizes  the  total  distance  we  need  to  transport  the  grains  of  sand. 

Remarkably,  it  turns  out  that  nothing  is  lost  in  estimating  the  Wasserstein 
distance  by  an  optimal  transportation  cost,  under  mild  technical  conditions. 
This  is  the  statement  of  the  following  classical  result. 

Theorem  4.13  (Monge-Kantorovich  duality).  We  have 

W1(ti,u)=  sup  |Em/  —  E„/|  =  inf  E  M[d(X,Y)] 

/GLip(X)  Mee(M.i') 

for  all  probability  measures  p,  v  G  J’i(X)  on  a  separable  metric  space  (X,  d). 

To  avoid  getting  distracted  by  technicalities,  we  will  prove  Theorem  4.13 
here  in  the  discrete  setting.  The  full  intuition  arises  here,  and  the  extension 
to  the  continuous  case  is  an  exercise  in  approximation  (Problem  4.3). 

Proof  (Discrete  case).  Let  p,  v  be  probabilities  on  the  finite  set  X  =  {1, . . .  ,p}. 
The  optimal  transportation  problem  can  evidently  be  phrased  as  follows: 

i,j— 1 

>  0,  1  <  i,j  <  p 

P 

^2 M(i,j )  =  1  <i<p 

j= i 

P 

^2,M(i,j)  =  v(j),  1  <j<p 

i= 1 

This  is  nothing  other  than  a  standard  linear  programming  problem.  The  dual 
linear  programming  problem  corresponding  to  this  primary  problem  is 

p  p 

Maximize:  V  f(i)p(i)  +  V  g{j)v(j) 

f’9  iZ i  1Z[ 

Subject  to:  f(i)  +  g(j)  <  d(i,j),  1  <  i,j  <  p. 


Minimize: 

M 

Subject  to: 
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By  the  strong  duality  theorem  of  linear  programming,  the  optimal  values  of 
these  two  optimization  problems  coincide,  so  we  have  proved 

inf  EM[rf(X,r)]  =  siipjE^Z  +  E^g  :  f(x)+g(y)  <  d(x,y)  Vx,y}  =:  (*)• 
m  ee(/r,^) 

We  must  now  show  that  the  expression  (*)  on  the  right-hand  side  coincides 
with  the  Wasserstein  distance.  Here  we  need  to  use  the  fact  that  d  is  a  metric 
(so  far,  we  only  used  that  d  is  a  nonnegative  weight  function!)  To  this  end, 
note  that  /,  g  satisfy  f(x)  +  g(y)  <  d(x ,  y)  for  all  x,  y  if  and  only  if 

f(x)  <  f(x)  :=  ini{d(x,z)  —  g(z)}  <  — g(x )  for  all  x. 

Z 

Moreover,  /  €  Lip(X)  as 

f(x)  -  f(y )  <  swp{d(x,  z)  -  d(y,  z)}  <  d(x,  y). 

Z 

It  follows  immediately  that 

E  J  +  E  vg  <  Ej  -  E  „/  <  Wiifi,  v) 

whenever  f(x)+g(y)  <  d(x,  y)  for  all  x,  y.  Thus  we  have  shown  (*)  <  v), 

while  (*)  >  W\(n,  v)  holds  trivially  (restrict  the  supremum  to  g  =  — /).  □ 

The  separability  assumption  of  Theorem  4.13  is  not  entirely  innocuous. 
For  example,  the  trivial  metric  d{x ,  y)  =  lx^y  considered  in  Example  4.9 
is  not  separable  (unless  X  is  discrete),  yet  Monge-Kantorovich  duality  still 
holds  in  this  case.  As  this  is  both  an  important  example  and  an  interesting 
illustration,  let  us  provide  here  a  direct  proof  of  Monge-Kantorovich  duality 
for  the  trivial  metric.  It  is  in  fact  possible  to  obtain  a  more  general  version  of 
Theorem  4.13  that  contains  both  separable  metrics  and  the  trivial  metric  as 
special  cases,  but  this  will  not  be  needed  for  our  purposes. 

Example  4-14  (Total  variation).  Let  d(x,y )  =  lXjty  be  the  trivial  metric.  We 
have  seen  in  Example  4.9  that  in  this  case  the  Wasserstein  distance  coincides 
with  the  total  variation  distance,  so  that  Monge-Kantorovich  duality  reads 

\\H  ~  HItv  =  inf  M[X^Y). 

That  is,  the  total  variation  distance  between  y,  v  is  the  minimal  probability 
that  random  variables  X  ~  y  and  Y  ~  v  do  not  coincide.  We  will  presently 
give  a  direct  proof  of  this  fundamental  result.  As 

||m-HItv=  sup  |Em[/(X)-/(F)]|  <M[X^Y] 

0</<l 

holds  trivially  for  every  M  €  it  suffices  to  construct  an  optimal  cou¬ 

pling  that  attains  equality  (in  contrast,  Theorem  4.13  is  not  constructive). 
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To  construct  an  optimal  coupling,  let  us  assume  that  we  can  write  dp  = 
fdp  and  dp  =  gdp  for  some  reference  measure  p  and  densities  /,  g  (this  entails 
no  loss  of  generality,  as  we  can  always  choose  p  =  p  +  p).  The  idea  is  now  to 
decompose  p  and  p  into  a  “common  part”  and  “disjoint  parts.”  We  can  then 
construct  a  coupling  by  letting  either  X  =  Y  be  drawn  from  the  common 
part,  or  drawing  X  and  Y  independently  from  the  disjoint  parts,  with  the 
probabilities  chosen  appropriately  so  that  this  is  a  coupling.  To  be  precise,  let 
us  define  the  “common  part”  p  and  the  “disjoint  parts”  p,  p  as 

dp  ■=  {/  A  g}dp,  dp  :=  {/  -  f  A  g}dp,  dp  :=  {g  -  f  A  g}dp. 

Then  p,  p,  p  are  all  positive  measures,  p  =  p  +  p,  p  =  p  +  p,  and  p.  p  have 
disjoint  supports.  This  construction  is  illustrated  in  the  following  figure: 


We  now  define  the  probability  measure  M  as 


M(da:,  dy)  =  p(dx)  Sx(dy) 


p(dx)  p(dy) 
1  -  d(X) 


(here  5X  denotes  the  point  mass  at  x).  It  is  readily  verified  that  M  £  S (p,  p) 
by  construction.  Moreover,  as  p,  p  have  disjoint  supports,  we  have 


M[X^  Y]  =  1  -  p(X)  =  J{f-f/\g}dp. 


But  note  that 

{f  -  f  ^g}dp=  ( {f  -  g)+dp=  sup  f  h{f  -  g}dp=\\p-p\\Tv- 
J  0<h<lJ 

Thus  we  have  constructed  an  optimal  coupling  that  attains  the  infimum  in 
the  Monge-Kantorovich  duality  formula  for  total  variation  distance. 

We  now  conclude  our  detour  through  the  optimal  transportation  prob¬ 
lem  and  return  to  the  investigation  of  concentration.  By  virtue  of  Monge- 
Kantorovich  duality,  it  evidently  follows  from  Theorem  4.8  that  f(X)  is  tr2- 
subgaussian  for  every  /  £  Lip(X)  and  X  ~  p  if  and  only  if 
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Wi(n,  v)  =  inf  EmW(X,  y)l  <  D{v\\fj)  for  all  v. 

Inequalities  of  this  type  are  called  transportation  cost  inequalities.  While  we 
have  previously  formulated  them  without  any  reference  to  transportation,  it 
turns  out  that  the  formulation  in  terms  of  optimal  transportation  is  crucial 
in  order  to  develop  a  suitable  tensorization  principle.  This  is  our  next  goal. 

How  might  we  expect  Lipschitz  concentration  to  tensorize?  It  is  not  even 
entirely  clear  what  is  meant.  Let  Pi  be  a  probability  measure  on  (X,.  dt)  for 
i  =  1 , ...  ,n,  such  that  each  p-t  satisfies  the  transportation  cost  inequality 

Wi(u,  Hi)  <  \/2a2D(v'\\pi)  for  all  v. 

We  would  like  to  deduce  that  the  product  measure  p\®-  •  -®/x„  on  Xi  x  •  •  •  xX„ 
satisfies  a  Lipschitz  concentration  property,  that  is,  that 

W\ (u,  fi\  ®  •  •  •  ®  pn)  <  yj2(j2D{i'\\pi  ®  •  •  •  ®  pn)  for  all  v. 


However,  to  even  make  sense  of  this  statement,  we  must  first  specify  a  met¬ 
ric  d  on  Xj  x  x  X„.  For  example,  one  might  be  interested  in  working 
with  the  fi-metric  d(x,  y)  =  d\(x\,yf)  +  •  •  •  +  d(xn,  yn ),  or  with  the  f?2-nietric 
d{ x,  y)  =  {di(xi,y±)2  +  ■  ■  ■+dn(xn,  pn)2}1^2,  or  with  any  other  suitable  combi¬ 
nation.  Ultimately,  however,  the  appropriate  choice  of  metric  will  be  dictated 
by  whether  we  are  able  to  prove  a  tensorization  principle.  As  will  become  clear 
in  the  sequel,  we  can  prove  different  forms  of  tensorization  in  product  spaces 
(i.e. ,  for  different  definitions  of  the  metric  d)  by  using  different  types  of  trans¬ 
portation  cost  inequalities.  It  is  therefore  fruitful,  rather  than  considering  one 
specific  setting,  to  prove  a  tensorization  principle  for  a  rather  general  class 
of  transportation  cost  inequalities.  The  following  theorem  does  precisely  that. 
Once  its  power  has  been  understood,  it  will  be  straightforward  to  interpret 
the  behavior  of  different  transportation  cost  inequalities  in  high  dimension. 

Theorem  4.15  (Marton).  Let  tp  :  K+->R+  be  a  convex  function,  and  let 
Wi  :  X.;  x  Xi  R+  be  positive  weight  function.  Suppose  that  for  i  =  1 , ...  ,n 

inf  </2(E]y[kuj(X,  F)])  <  2cr2 D{v\\pi)  for  all  v. 
m  ee(/ii,o 

Then  we  have 

n 

inf  y'^(EMK(X,:,yj)])  <  2<j2D(is\\p1  ®  •  •  •  ®  pn)  for  all  v. 

1=1 


The  transportation  cost  inequality  IUi  (/i,,  v)  <  \j2a2D(y\ \p.i)  corresponds 
to  the  assumption  of  Theorem  4.15  with  p{x)  =  x2  and  Wi(x,y)  =  di(x,y). 
However,  the  quantity  on  the  left-hand  side  of  the  “tensorized”  inequality 


1/2 


inf  Vem  [diiX^Yi] 


<  \/2cr2D(v\\p1  ®  •  •  •  ®  pn) 
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is  not  itself  a  Wasserstein  distance.  We  must  therefore  take  an  extra  step  to 
use  this  general  tensorization  principle.  For  example,  if  we  define 

n 

dc(x,  y)  := 

i= 1 

the  weighted  t\- metric  on  Xi  x  •  •  •  x  X„,  we  obtain  the  following. 

Corollary  4.16.  Suppose  that  the  transportation  cost  inequality 
Wi{pi,v)  <  \j2<T2D{y\\p,i)  for  all  v 

holds  for  pi  on  (Xj,  df)  for  i  =  1, . . . ,  n.  Then  the  transportation  cost  inequality 
Wi(pi  (g)  •  •  •  (g>  pn,  v )  <  \/2a2D(u\\pi  (g)  •  •  •  (g  pn)  for  all  v 
holds  for  pi  (g)  •  •  •  (g  pn  on  (Xi  x  •  •  •  Xn,  dc )  whenever  ]C"=1  c 2  =  1. 

Proof.  For  probability  measures  v,  p  on  (Xx  x  •  •  •  X„,  dc),  we  have 

n 

Wi(v,p)=  inf  y'ciEM[di(Xi,yi)]< 

i—1 

by  the  Cauchy-Schwarz  inequality  (as  Y^i= i  cf  =  !)•  The  result  now  follows 
from  Theorem  4.15  with  tp{x)  =  x2  and  Wi(x,y)  =  di(x,y).  □ 

Corollary  4.16  yields  immediately  another  proof  of  McDiarmid’s  inequality. 

Example  f.l 7  (McDiarmid’s  inequality).  The  trivial  metric  di(x,y)  = 
on  X,;  satisfies  the  transportation  cost  inequality  Wi(p,  v)  <  {fD(v\ \p)}1^2 
by  Pinsker’s  inequality  (Example  4.9).  Therefore,  by  Corollary  4.16,  we  have 

Wi(pi  g)  •  •  •  (g )  pn,v)  <  \J^D(u\\p1<S>---<^  pn) 

on  Xi  x  •  •  •  x  X„  with  respect  to  the  weighted  Hamming  distance  dc(x,  y)  = 
CjXx^m-  Thus  Theorem  4.8  yields  precisely  the  Lipschitz  formulation 
of  McDiarmid’s  inequality  discussed  in  Example  4.4. 

By  using  the  Cauchy-Schwarz  inequality  as  in  Corollary  4.16,  the  tensoriza¬ 
tion  principle  of  Theorem  4.15  yields  dimension- free  concentration  inequalities 
in  terms  of  weighted  £i-metrics.  In  the  next  section,  we  will  use  a  more  refined 
version  of  the  argument  that  led  to  the  transportation  proof  of  McDiarmid’s 
inequality  to  prove  Talagrand’s  concentration  inequality,  which  is  a  crucial 
improvement  over  McDiarmid’s  inequality  in  terms  of  “one-sided  differences” 
that  makes  it  possible  to  obtain  lower  tail  bounds  in  many  situations  where 
a  direct  application  of  the  log-Sobolev  machinery  fails. 


inf  Vem  [diiX^Y 
Mgep,p)  ' 
i—1 
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On  the  other  hand,  Corollary  4. 16  does  not  capture  dimension- free  concen¬ 
tration  with  respect  to  ^-metrics,  such  as  we  have  seen  in  the  case  of  Gaussian 
concentration.  It  turns  out  that  not  every  probability  measure  /i  tensorizes 
in  an  f  2-fashion.  Nonetheless,  by  using  Theorem  4.15  in  a  different  manner, 
we  will  be  able  to  completely  characterize  measures  /i  for  which  this  is  the 
case  using  transportation  cost  inequalities.  This  will  be  discussed  in  detail  in 
section  4.4  below,  and  we  postpone  further  discussion  until  then. 

The  remainder  of  this  section  is  devoted  to  the  proof  of  Theorem  4.15.  The 
first  step  in  the  proof  will  be  based  on  the  following  elementary  property. 

Lemma  4.18  (Chain  rule  for  relative  entropy).  Let  M,  N  be  probability 
measures  that  define  the  joint  distribution  of  random  variables  X,  Y.  Then 

D(M{X,Y  G-}||N{X,Y  €•})  = 

D(M{X  G  -}\\N{X  G  •})  +  Em[T(M{7  g  -|X}||N{Y  G  -\X})]. 


Proof.  It  is  readily  verified  for  M  <C  N  that 

dM{X,Y  £  •}  _  dM{X  G  •}  dM{Y  G  -\X} 
dN{X,  Y  G  ■}  “  dN{X  G  ■}  dN{Y  G  -\X} 

by  definition  of  the  Radon-Nikodym  density  (this  is  the  Bayes  formula).  Thus 


D(M{X,Y  G  -}||N{X,F  G  •})  = 


Em 


dM{X  G  •} 
dN{X  G  •} 


+  Em 


Em 


dM{Y  G  • \X } 
dN{Y  G  -IX} 


and  the  conclusion  follows  from  the  definition  of  relative  entropy. 


□ 


We  now  complete  the  proof  of  Theorem  4.15. 

Proof  (Theorem  f.15).  The  case  n  =  1  is  trivial  as  the  conclusion  coincides 
with  the  assumption.  We  will  proceed  with  the  proof  by  induction  on  n.  That 
is,  let  us  suppose  that  the  result  has  been  proved  for  the  case  n  =  k.  We 
presently  show  that  this  implies  the  result  holds  also  for  the  case  n  =  k  +  1. 

Fix  for  the  time  being  a  probability  measure  v  on  Xi  x  •  •  •  x  Xfc+1.  Let 
i/G  be  the  marginal  of  v  on  Xi  x  •  •  •  x  X&,  and  let  ,...,xk  be  a  version  of 
the  regular  conditional  probability  P[Xfc+i  G  -\Xi, . . .  ,Xk].  Then 

D{v\ |/ii  <g>  ■  ■  ■  <g>  /J,k+ 1)  =  D(v^k)  \\m  <g>  -•  •  ®  pk)  +  Y,„[D(nXu„^xk\\hk+i)\ 


by  the  chain  rule  for  relative  entropy.  We  can  now  apply  the  induction  hy¬ 
pothesis  to  the  first  term  on  the  right  and  the  assumption  of  the  Theorem  to 
the  second  term  on  the  right.  In  particular,  by  the  induction  hypothesis 

k 

2<72D(j/fe)||/ii  (g)  •  •  •  <g>  pk)  >  inf  ^(p(EMK(^i,Ej)]), 
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while  by  the  assumption  of  the  Theorem 

2o-2£>Ki |Mfc+i)  >  inf  w(EM[wfc+i(X,y)]). 

mse(nk+1,vvi,...,yk) 

Fix  £  >  0,  and  choose  an  £-minimizer  G  6(^1  ®  •  •  •  <g)  yk,  v^)  in  the 

first  inequality  and  an  £-minimizer  M:Vlt__yfc  G  Q(yk+\,  vVl,...,yk)  in  the  second 
inequality  for  every  choice  of  y-\, ... ,  yk.  Then  we  have  shown  that 


2a2 D(v\\ij,i  g)  •  •  •  (g)  Hk+i)  > 
k 

'y  \  ^(Emw  [u>i (Xi .  Y*)])  +  (^(E]y[(fc)  [EMvi Yk  [wfc+i  (Afc_|_i ,  1  fc+i )]] )  2e, 

*= l 

where  we  have  used  convexity  of  tp  and  that  (Y-i , . . . ,  Yj.)  ~  under  M^fc). 

We  now  construct  a  coupling  M  G  C(/zi<g)-  ■  -®Pk+ uv)  by  sticking  together 
the  couplings  M(fc)  and  Myi;...iyfe.  To  be  precise,  define  M  such  that 

M [X1,...,Xk,Y1,...,Yk  G  ■]  =M«, 

M[Xfc+1,Yfc+1  e-\X1,...,Xk,Y1,...,Yk)  =MYl_Yk. 

It  is  readily  verified  that  M  G  C(^i  ®  •  •  •  g) yk+i,  v),  so  by  the  above  inequality 

fc+i 

2cr2E)(^||Mi®---®Mfc+i)  >  inf  'S^ip{E,m[wi{Xi,Yi)])-2e. 

t—1 

As  £  >  0  and  v  were  arbitrary,  the  proof  for  the  case  n  =  k+ 1  is  complete.  □ 

Remark  4.19.  There  is  a  minor  technical  issue  that  we  have  ignored  in  the 
above  proof.  We  selected  an  £-minimizer  Myij...>y)i.  independently  for  every 
choice  of  y±, , . .  ,yk,  but  in  order  for  the  remaining  computations  to  make 
sense  we  must  ensure  that  Myii._yfc  depends  on  y\) . . . .  yk  in  a  measurable 
fashion.  However,  this  purely  technical  issue  can  be  resolved  using  standard 
measurable  selection  arguments  in  any  standard  Borel  space. 

Problems 

4.3  (Monge-Kantorovich  duality:  continuous  case).  We  have  stated 
Theorem  4.13  in  the  setting  where  (X,  d)  is  a  separable  metric  space.  However, 
we  only  provided  a  proof  for  the  case  where  X  is  a  finite  set.  The  goal  of  this 
problem  is  to  work  through  the  approximations  needed  to  deduce  the  general 
result  from  the  discrete  case.  To  avoid  confusion,  define 

Ti  (/*,!/):=  inf  E  M[d(X,Y)}. 

Mee(y,y) 

Our  aim  is  to  show  that  Ti(/q  u)  —  IT'i  ( //. ,  u). 
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a.  Prove  that  Xj  is  a  metric  on  (X) . 

Hint:  to  prove  T\{p,u)  <  Tj(/x,  p)  +  Ti(v,p),  choose  e-optimal  couplings 
Mi,  M2  in  the  definitions  of  J\  (p,  p),  Tj(iy  p)  and  consider  M[X,  Y,  Z  £  •  ] 
defined  by  M  [X,Y  £  ■}  =  Mi  and  M  [Z  £  ■  \X,  Y]  =  M2[X  £  •  |Y], 

b.  For  every  k  £  N,  construct  disjoint  sets  B k  C  X  as  follows: 

n—  1 

Bk  =  {x  £  X  :  d( x,  Xl)  <  2~k},  Bk  =  {x£X:  d(x,  xn)  <  2“fc}\  [J  Bk, 

i- 1 

where  {xn  :  n  G  N}  is  a  countable  dense  subset  of  X.  Choose  an  arbitrary 
point  yk  £  Bk  for  every  n,  k.  For  any  p  £  IPi(X),  we  now  define 

OO 

pk  ^  * 

n=  1 

Show  that  we  have  Wi(pk,p)  <  Ti(pk,p)  <  2_fe  for  all  k  £  N. 

c.  Show  that  the  above  construction  can  be  modified  such  that  pk  has  finite 
(rather  than  countable)  support  for  all  k  £  N,  and  Ti(pk ,  p)  — >  0  as  k  — >  oo. 


cl.  Conclude  using  the  already  proved  discrete  case  of  Theorem  4.13  that  the 
conclusion  extends  to  the  case  where  (X,  d)  is  any  separable  metric  space. 

4.4  (Monge-Kantorovich  duality  on  R).  In  many  cases,  explicit  compu¬ 
tation  of  the  Wasserstein  distance  is  impossible.  However,  there  is  an  explicit 
expression  for  the  Wasserstein  distance  on  the  real  line  (R,  |  •  | ) : 

/OO 

I F(t)  -  G(t)  |  dt, 

-OO 

where  F(t. )  =  <  t]  and  G(t )  =  PU[X  <  t]  denote  the  cumulative 

distribution  functions  of  p  and  v,  respectively. 

a.  Show  that  for  smooth  functions  /  with  compact  support 

[  f  dp  =  —  f  f'(t)F(t)dt. 

J  J  OO 

b.  Use  the  previous  part  to  prove  the  explicit  expression  for  W\(p,  v). 


c.  By  Monge-Kantorovich  duality,  we  obtain 


inf 

Mee((i,i/) 


Em[\X-Y\] 


\F(t)-G(t)\dt. 


~  v. 


Find  an  explicit  construction  for  the  optimal  coupling  M. 
Hint:  let  U  ~  Uniform[0, 1].  Then  F_1(t/)  ~  p  and  G~l{U ) 
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4.5  (Concentration  for  Markov  chains).  The  transportation  method  can 
be  useful  for  obtaining  concentration  results  for  dependent  random  variables. 
The  goal  of  this  problem  is  to  develop  the  simplest  possible  example  of  this 
kind.  Let  Xi, . . . ,  Xn  be  a  Markov  chain  with  transition  kernels 

P[Xk  G  A\Xx, . . . ,  Xfc_r]  =  Qk(Xk^,A). 

We  will  assume  that  the  chain  satisfies  the  Doeblin  condition 

||  Qk(x,  •)  -  Qk(x',  * )  1 1  tv  <1-Q!  for  all  x,x' 

for  some  o  >  0.  Even  though  Xi, . . . ,  Xn  are  not  independent  (we  denote  their 
joint  distribution  as  p),  we  can  still  obtain  a  transportation  cost  inequality 
by  adapting  the  proof  of  the  tensorization  principle  of  Theorem  4.15. 

a.  Let  pi,p2,P3  be  probability  distributions  on  the  same  space.  Show  that 
there  exists  a  joint  distribution  M  of  random  variables  X,  Y,  Z  such  that 

M[X  G  •  ]  =  pi,  M[Y  G  •]  =  p2,  M[ZG-]=P3, 

and  such  that 


M[X  ^  Y]  =  || pi  -  P2 1| tv 7  M[Y  ^  Z\  =  ||p2  -  P3||tv- 

Hint:  this  is  similar  to  part  a.  of  Problem  4.3. 

b.  Let  v  be  any  distribution  of  random  variables  Y\,...,Yn.  Construct  the 
probability  measure  M  such  that  Zk  =  (Xk.  Xk,  Yk),  k  <n  satisfy 


M[Xfc  G  •  \ZX, . . 

•  5  Xk— i]  —  Qk{^Xk. 

M[Xk  G  •  |Z;l,  . . 

•  5  Xk— i]  —  Qkfik- 

M [Yk  G  •  \Zi, . . 

■  i  Zk-i]  =  v{Yk  G 

and 

M[Xk^Xk\ Zu...,Zk_!] 

=  \\Qk(xk-i,  ■)  — 

M[Xk^Yk\Z1,...,Zk_1] 

=  \\Qk{Yk-i,  ■)  -  > 

Show  that 

M[Xk^Yk\Z1,...,Zk_1] 

<  y/±D(v(Yke-\Yu. 

■ . ,  Yfc-i)||Qfc(Yfc-i 

c.  Now  adapt  the  proof  of  Theorem  4.15  to  show  that 


n  - 

* y*i '  for ai1 

v  k=  1 

and  deduce  an  extension  of  McDiarmid’s  inequality  in  the  present  setting 
(in  the  case  of  equal  weights).  The  independent  case  is  recovered  if  a  =  1. 
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4.3  Talagrand’s  concentration  inequality 

Up  to  this  point,  the  metric  perspective  and  the  transportation  method  did 
not  yield  any  new  results  beyond  a  complementary  point  of  view  on  the  con¬ 
centration  phenomenon.  In  the  present  section,  however,  we  will  see  that  the 
metric  approach  to  concentration  allows  us  to  prove  new  concentration  results 
that  were  not  accessible  by  the  methods  we  have  developed  so  far. 

Let  Xi, . . .  ,Xn  be  independent.  To  understand  the  issue  at  hand,  let  us 
once  more  consider  McDiarmid’s  inequality.  One  way  to  phrase  it  is  as  follows: 

IIA/lloo  <  Ci  for  1  <  i  <  n  => 

P  [/(*!,  •  •  • ,  Xn)  -  E  f(Xu  ...,Xn)>t)<  e-2t2/^"=i  c*  for  t  >  0. 

We  proved  this  result  in  three  different  ways:  using  the  martingale  method,  the 
transportation  method,  and  the  entropy  method.  The  latter  method,  however, 
was  able  to  produce  much  stronger  results  in  terms  of  one-sided  differences. 
For  example,  we  obtained  in  Theorem  3.18  the  one-sided  bound 

D~ f(x)  <  Ci(x)  for  1  <  i  <  n  => 

P  [f(Xu  ...,Xn)~  E  f(Xu  ...,Xn)>t]<  e-*2/4"  «?IU  for  t  >  0. 

This  is  often  a  crucial  improvement  over  McDiarmid’s  inequality.  Unfortu¬ 
nately,  while  McDiarmid’s  inequality  is  a  subgaussian  inequality  (it  gives  both 
an  upper  and  a  lower  tail  bound  by  applying  the  bound  to  f  and  —  /),  the 
one-sided  result  obtained  by  the  entropy  method  can  only  give  an  upper  tail 
bound  and  not  a  lower  tail  bound  in  terms  of  the  one-sided  differences  D~  f 
(as  D~(—f)  ^  — D~  f ).  There  are  many  situations  in  which  one  can  control 
D~  f  only  (cf.  Example  3.19),  and  we  have  not  yet  developed  any  tool  that 
can  yield  the  subgaussian  property  in  such  cases. 

The  aim  of  this  section  is  to  investigate  the  one-sided  difference  inequality 
from  the  perspective  of  Lipschitz  concentration.  What  type  of  Lipschitz  prop¬ 
erty  does  the  one-sided  bound  correspond  to?  For  McDiarmid’s  inequality,  the 
property  ||A||oo  <  Ci  for  all  i  is  equivalent  to  the  Lipschitz  property 

n 

f(x)  -  f(y )  <  ^2  Cilxi^yi  for  all  x,  y. 

i= 1 

If  we  relax  the  assumption  to  D~  (x)  <  Ci(x)  for  all  i,x,  it  is  therefore  natural 
to  consider  the  analogous  “one-sided  Lipschitz  property” 

n 

f(x)  -  f(y )  <  y^Cj( x)lXi^Vi  for  all  x,y. 

i— 1 

It  is  easily  seen  that  the  latter  property  does  indeed  imply  D~  f(x)  <  Ci(x). 
However,  the  converse  is  not  true:  the  one-sided  Lipschitz  property  is  strictly 
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stronger  than  control  on  the  one-sided  gradient.  While  the  two  assumptions 
can  often  be  verified  in  the  same  manner  in  applications,  the  one-sided  gra¬ 
dient  bound  is  not  naturally  expressed  as  a  Lipschitz  property,  while  the 
one-sided  Lipschitz  property  is  not  naturally  expressed  as  a  gradient. 

We  have  thus  arrived  at  a  fork  in  the  road  where  the  perspective  of  the 
present  chapter  diverges  from  the  perspective  developed  in  the  previous  chap¬ 
ters.  To  exploit  the  one-sided  Lipschitz  property,  we  will  use  the  transportation 
method  to  derive  an  important  concentration  inequality  due  to  Talagrand.  The 
remarkable  aspect  of  this  result  is  that  it  yields  the  full  subgaussian  property 
(i.e. ,  an  upper  and  lower  tail  bound)  even  though  only  a  one-sided  assumption 
was  imposed.  This  makes  it  possible  to  obtain  lower  tails  in  many  examples 
that  were  out  of  reach  of  the  theory  developed  in  the  previous  chapter. 

Theorem  4.20  (Talagrand).  Let  Xi, . . .  ,Xn  be  independent,  and  suppose 

n 

f(x )  -  f(v)  <  Ci{x)lXi^Vi  for  all  x,  y. 

i= 1 

Then  f(X  i, . .  .,Xn)  is  ||  X)”=  i  ci  ll°o  -subgaussian. 

Remark  4.21.  As  the  one-sided  Lipschitz  assumption  implies  D~  f(x)  <  Ci(x), 
the  upper  tail  bound  obtained  from  Talagrand’s  inequality  is  in  fact  slightly 
worse  than  the  upper  tail  bound  obtained  from  the  one-sided  difference  in¬ 
equality  of  Theorem  3.18.  As  was  emphasized  above,  the  key  improvement 
over  the  previous  chapter  is  the  lower  tail  bound.  On  the  other  hand,  we  will 
see  in  the  proof  of  Theorem  4.20  that  the  lower  tail  bound  can  be  proved  with 
variance  proxy  E[^”=1  cf],  which  is  even  better  than  the  bound  ||  Y^i= i  cf  lloo 
given  in  the  statement  given  above  (in  fact,  this  variance  proxy  coincides  with 
the  variance  bound  of  Corollary  2.4).  Thus  the  statement  of  Theorem  4.20 
can  be  somewhat  improved  both  in  the  upper  and  lower  tails,  but  the  present 
(already  useful)  statement  is  the  most  compact  form  of  the  result. 

To  illustrate  Talagrand’s  inequality,  let  us  revisit  Example  3.19. 

Example  4-22  (Random  matrices).  We  recall  the  setting  of  Examples  2.5  and 
3.19.  Let  M  be  an  n  x  n  symmetric  matrix  where  {My  :  i  >  j}  are  i.i.d.  sym¬ 
metric  Bernoulli  random  variables  P =  ±1]  =  4.  We  denote  by  Amax(M) 
the  largest  eigenvalue  of  M,  and  by  umax (M )  a  corresponding  eigenvector. 

In  Example  2.5  we  computed  the  one-sided  differences  D“  Amax(M).  How¬ 
ever,  the  one-sided  Lipschitz  property  can  be  verified  is  precisely  the  same 
manner.  In  particular,  repeating  the  computation  of  Example  2.5,  we  obtain 

^max  m-  ^max  (M')  <  2^umax(M) 

i^max  (  A/);>(A/0  -  ML) 

i>j 

Li  4  y  ]  |umax(Af),;  |  |umax(M)j|  1  Mi:i 7^ MC  ■ 

i>j 
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The  function  M  i— >  Amax(M)  therefore  satisfies  the  one-sided  Lipschitz  prop¬ 
erty  with  weights  Cy(M)  =  4|nmax(M)i||nmax(M)j|.  It  now  follows  imme¬ 
diately  from  Talagrand’s  concentration  inequality  that  the  random  variable 
Ama x{M)  is  16-subgaussian.  Thus  we  have  finally  obtained  a  full  subgaussian 
counterpart  of  the  variance  bound  obtained  in  Example  2.5. 

The  one-sided  Lipschitz  assumption  of  Talagrand’s  concentration  inequal¬ 
ity  corresponds  to  a  (local)  Lipschitz  property  with  respect  to  a  weighted 
Hamming  distance.  When  one  is  dealing  with  real-valued  random  variables, 
it  is  often  most  convenient  to  consider  Lipschitz  properties  with  respect  to 
the  usual  Euclidean  distance.  While  one  can  obtain  such  a  result  for  specific 
distributions  (for  example,  in  the  Gaussian  case),  it  is  not  generally  true  that 
distributions  in  R”  satisfy  a  concentration  proeprty  with  respect  to  the  Eu¬ 
clidean  distance.  However,  for  convex  functions,  such  a  concentration  property 
turns  out  to  hold  for  any  family  of  independent  bounded  random  variables, 
regardless  of  the  specific  properties  of  their  distributions.  This  simple  obser¬ 
vation  is  a  very  useful  consequence  of  Talagrand’s  inequality. 

Corollary  4.23.  Let  X±, . . .  ,Xn  be  independent  with  values  in  [0,1].  Then 
f(Xu...,Xn)  is  ||||V/||2||oo  -subgaussian  for  every  convex  function  f. 

Proof.  The  first-order  condition  for  convexity  implies 

f(x)  -  f{y)  <  V/( x)  -(x-y)  for  all  x,  y. 

As  | Xi  —  yi\  <  1  by  assumption,  we  obtain 

n 

f{x)  -  f(y)  <  Y 

i— 1 

The  result  follows  immediately  from  Theorem  4.20.  □ 

We  now  turn  to  the  proof  of  Theorem  4.20.  We  will  attempt  to  follow 
as  closely  as  possible  the  transportation  proof  of  McDiarmid’s  inequality  in 
Example  4.17.  Of  course,  unlike  the  weighted  Hamming  distance,  the  quantity 
]T)  Cj( x)lXijtyi  that  appears  in  the  one-sided  Lipschitz  property  is  not  a  metric: 
it  is  not  even  symmetric  in  x,  y\  Remarkably,  this  turns  out  to  be  unimportant: 
we  will  prove  a  transportation  cost  inequality  for  an  asymmetric  notion  of 
Wasserstein  “distance”  that  captures  the  one-sided  Lipschitz  property. 

Theorem  4.24  (Marton).  Define  the  asymmetric  “distance” 

d2(p,  v)  :=  inf  sup  Em 

MeeO’0  Em[E?=i  ci(x)2]<i 

between  probability  measures  fi,  v  on  Xi  x  •  •  •  x  Xn.  Then 


.  i=  1 


df(x) 

dxi 
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d-2(v,  Hi®  ■  ■  ■  ®  Hn)  <  \J^D(v \\hi  ®  ■  ■  ■  ®  Hn), 

d2(Hi  ®  ■■  -  ®  Hn,  v)  <  \/2D(v\\hi  ®  ■  •  •  ®  Hn) 

for  any  probability  measures  v  and  Hi  ®  "  '  ®  Hn  and  Xi  x  x  Xn. 

With  this  asymmetric  transportation  cost  inequality  in  hand,  the  remain¬ 
der  of  the  proof  follows  exactly  as  in  the  previous  sections. 

Proof  (Theorem  4-20).  Suppose  /  satisfies  the  one-sided  Lipschitz  property 

n 

f(x)  -  f(y)  <  Ylci{x) iXi^Vi. 

i=  1 

Let  fi  :=  fj,\  ®  ®  fin  be  a  product  and  let  v  be  any  probability.  Then 

E „/  -  Em/  =  inf  Em[/(I)  -  /(F)]  <  MY.U  c^d^n), 

E  pf  -  E „/  =  inf  Em[/(I)  -  /(F)]  <  [£?=1  c?]  ^ d2(n,  0- 

We  therefore  have  by  Theorem  4.24 

|E„/  -  Em/|  <  ^HELrcflloo^HI/t), 

and  it  follows  precisely  as  in  the  proof  of  Theorem  4.8  that  f(X i, . . . ,  Xn)  is 
II 12?=  l  cf||oo-subgaussian  whenever  X  ~  /ii  ®  ■  ■  ■  ®  Hn-  □ 

Remark  4-25.  We  have  used  Theorem  4.8  to  deduce  the  subgaussian  property, 
which  by  its  definition  controls  both  the  upper  and  lower  tail  probabilities. 
The  proof  of  Theorem  4.8,  however,  implies  also  a  one-sided  result:  given  /,  /r, 

logE^[eA{/”E"/}]  <  for  all  A  >  0 

if  and  only  if 

E vf  -  E nf  <  \Jcla'1D(y\ \h)  for  all  v. 

As  A  >  0  here,  this  characterizes  the  upper  tail;  the  lower  tail  is  obtained  by 
applying  this  result  to  —  /.  Now  note  that  there  is  an  asymmetry  in  the  proof 
of  Theorem  4.20:  for  the  upper  tail,  the  best  we  can  do  is 

E„/-Em/<  V2||Er=i^lloc-DHlM)  for  all 

for  the  lower  tail,  however,  we  have  an  even  better  bound 

E „/  -  E„/  <  V2EmE;=i  cf)  D(v\\h)  for  all  v. 

Thus  the  proof  of  Theorem  4.20  already  yields  a  sharper  conclusion:  for  t  >  0 

P[f{X)  >  E f(X)+t]  < 

P[/(X)  <  E f(X)  -t]<  e-*2/2EE< °i(x)2] 

when  {Xj}  are  independent  and  /  satisfies  the  one-sided  Lipschitz  property. 
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The  rest  of  this  section  is  devoted  to  the  proof  of  Theorem  4.24.  Following 
the  logic  of  the  previous  section,  the  proof  will  consist  of  two  parts.  First,  we 
will  use  a  tensorization  principle  to  reduce  the  problem  to  the  one-dimensional 
case.  Then,  we  will  give  a  direct  proof  of  Theorem  4.24  in  one  dimension,  that 
is,  we  will  prove  an  asymmetric  analogue  of  Pinsker’s  inequality. 

In  order  to  understand  how  to  apply  tensorization,  let  us  begin  by  stating 
a  simple  reformulation  of  the  asymmetric  distance  d,2- 


Lemma  4.26.  For  any  p,  v  on  Xi  x  •  •  •  x  X„,  we  have 


d2{p,v)  = 


„XE 


Mee(^,d  z  , 


1/2 


Proof.  This  follows  immediately  from 


E 


M 


Xi^Yi 


=  E 


M 


Y/ci(X)M[Xi^Yi\X] 


and  Cauchy-Schwarz  for  the  inner  product  (c,  c)  =  Em[X^T=i  ci(X)ci(X)}.  □ 

This  simple  reformulation  of  the  definition  of  d2  is  already  very  close  to  the 
form  of  the  tensorization  principle  that  we  proved  in  Theorem  4.15.  In  fact, 
only  a  minor  modification  is  needed  in  the  proof  to  establish  the  following. 


Proposition  4.27.  Let  /q  be  a  probability  measure  on  X.;  such  that 

inf  Em[M[I  ^  Y IX]2]  <  2D{v\\pa)  for  all  v 
m  6e(Mi,^) 


holds  for  every  i  =  1, . . . ,  n.  Then  we  have 

n 

inf  Y  EM[M[Xi  ^  F,;|X]2]  <  2D(v\\pi  ®  •  •  •  ®  pn)  for  all  v. 

i — i 

The  same  conclusion  follows  if  the  infimum  in  the  first  inequality  is  replaced 
by  Ms  G[v,  pf)  and  in  the  second  inequality  by  Me  G(v,  pi  ®  •  •  •  ®  pn)  ■ 

Proof.  We  follow  closely  the  proof  of  Theorem  4.15.  Suppose  the  conclusion 
has  been  proved  for  the  case  n  =  k;  it  suffices  to  show  that  it  holds  for  the 
case  n  =  k  +  1.  To  this  end,  define  probability  measures  u,  v^k\  uy  i  ,...,yk  as  in 
the  proof  of  Theorem  4.15,  and  fix  £  >  0.  By  the  induction  hypothesis,  we  can 
find  G  G(pk  ®  •  •  •  ®  pkl  v^)  and  MSlr..A  G  G{pk+\ >vVi,...,yk)  such  that 

k 

2 D(^  |  |Ml  ®  •  •  •  ®  pk)  >  E  Emm  [M<*>  [Xt  ^  Yt  |X]2]  -  £, 

4=1 

2D{i'yi'"^yk\\pk+i)  >  Em#i . Vk[Myi_yk[X^Y\X\  ]  —  £. 
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Define  M  £  G(pi  (g)  •  •  •  g)  pk+i,  v)  as  in  the  proof  of  Theorem  4.15.  Then  we 
obtain  using  the  chain  rule  of  relative  entropy  and  the  definition  of  M 

k 

2D{v\\n\  (g)  •  •  •  g)  pk+i)  >  22  EM[M[Xj  ^  Yi\Xi, . . . ,  Xk ]2]  -  2e 

i— 1 

+  EM[M[Xk+1^Yk+1\Y1,...,Yk,X}2}. 

Now  note  that  as  ’M.Y1,...,Yk[Xk+i  S  •]  =  pk+ 1,  evidently  Xk+i  is  independent 
of  {Xh  Yi  :  i  <  k}.  Thus  M[X<  ±  Yi\Xu  ...,Xk]  =  M[Xt  ±  Yi\X\,  so 

fe+i 

2 D(v\\m  ®  •  •  •  ®  Mfe+i)  >  em[M[X,  ^  Yz |X]2]  -  2e 

i= 1 

using  Jensen.  Taking  the  infimum  over  M  and  letting  e  |  0  yields  the  claim. 

The  case  where  v  and  p  are  reversed  corresponds  to  reversing  the  roles  of 
X  and  Y  in  the  above  proof.  Thus  the  only  change  in  the  proof  is  that  we 
must  now  show  M[X,;  ^  Y-,  V'i ......  Yk\  =  M [Xi  ^  y|y].  This  follows  as  Yk+\ 

is  conditionally  independent  of  X,;  given  Yi, . . . ,  Yk  by  the  definition  of  M.  □ 

By  virtue  of  Proposition  4.27,  it  remains  only  to  prove  Theorem  4.24  in  the 
case  n  =  1.  To  this  end,  we  will  first  prove  an  analogue  of  Monge-Kantorovich 
duality  in  this  setting  by  adapting  the  computations  in  Example  4.14. 

Lemma  4.28.  Suppose  that  p  ~  v  are  probability  measures  on  X.  Then 

r  ,  f  7  \  2  -i  ^ 

inf  Em[M[X/  Y|X]2]5  =  sup  {EM/-E „/}  =  /  1-/  dp  . 

Mee^,®  /> o  L  J  \  dp J  + 

m(/2)<  i 

Proof.  It  is  easily  seen  by  Cauchy-Schwarz  that 

s„p{E„/  -  E„/}  =  sup  j  (l  -  /  dp  =  y  (l  -  I)  +dp]  ’ , 

where  the  supremum  taken  is  over  /  >  0,  p(f2)  <  1.  Moreover, 
sup{E^/  -  E vf}  =  inf  sup  EM [/(X)  -  f(Y)} 

<  inf  sup  Em  [/ (X)lXjtY\ 

=  inf  Em[M[X/  Y|X]2]!. 

It  remains  to  prove  that  the  inequality  is  attained.  To  this  end,  construct 
precisely  the  same  coupling  M  £  C (p,  v)  as  in  Example  4.14.  Then 

M[.Y^y|A1=(l-|m)  +  , 

and  it  follows  immediately  that  Em[M[X  ^  Y|X]2]  =  f(l  —  jr)+dp.  □ 
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We  can  now  complete  the  proof  of  Theorem  4.24. 

Proof  (Theorem  4-2f).  By  Proposition  4.27,  it  suffices  to  consider  the  case 
n=l.  That  is,  we  must  prove  for  any  probability  measures  /i,  v  on  X 

(k{y,n)  <  y/2D(v\\ii),  d2(fj,,u)  <  y/2D(v\\(j,) 


(this  is,  in  essence,  an  asymmetric  analogue  of  Pinsker’s  inequality).  It  suffices 
to  assume  v  <C  /z,  as  otherwise  D(y ||/i)  =  oo  and  the  result  is  trivial.  By 
a  simple  perturbation  argument,  we  can  assume  that  /r  ~  v  (replace  v  by 
dve  =  (1  +  e)~1(j^  +  e)dfi  and  let  £  J,  0  at  the  end  of  the  proof). 

The  proof  is  ultimately  a  calculus  exercise.  It  is  not  difficult  to  show  that 

(1  —  x')2  (1  —  x')2 

x  log  x  —  x  +  1 - — —  >0,  —  log  x  —  1  +  x - - - >  0 

for  0  <  x  <  1  (note  that  the  inequalities  hold  for  x  =  1,  and  the  left-hand 
sides  in  these  inequalities  are  decreasing  functions  for  0  <  x  <  1).  Thus 


xlogx  —  X  +  1  =  (xlogx  —  X  +  l)la;<l  +  %(—  logx  —  1  +  X  i)la,>i 
> 


(l-x)++x(l-x  1)\ 


for  all  x  >  0.  We  can  therefore  estimate 


dfi  \  dv 
djji 


d2([i,  v)2  +  d2(v,  Ai)2  =  J  ^1  -  ^  dn  +  /(!- ^ 

„  f  ( du ,  dv  dv  A  ,  j ,  j,  . 

<2/  hrlog^__W  +  1  )d^  =  2D(yv\\d)- 

J  \afi  d[i  dfi  ) 


This  evidently  implies  the  claim. 


□ 


Problems 

4.6  (Rademacher  processes).  Let  £\,...,£n  be  independent  symmetric 
Bernoulli  random  variables  Pfe,  =  ±1]  =  and  let  T  C  K™.  Define 

n  n 

Z  =  sup  Sktk ,  cr2  =  4  sup  y ^  t’l . 

teT  k= 1  teT  k= 1 

Show  that  Z  is  er2-subgaussian  (cf.  Problems  2.2,  3.7,  and  3.14). 

4.7  (Balls  and  bins).  Suppose  that  m  balls  are  thrown  independently  and 
uniformly  at  random  into  n  bins.  Let  Z  be  the  number  of  empty  bins.  What 
can  we  say  about  the  magnitude  and  fluctuations  of  the  random  variable  Z? 

a.  Show  that  E [Z\  =  n(  1  —  l/n)m. 
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b.  Use  McDiarmid’s  inequality  to  show  that  Z  is  ro/4-subgaussian. 

The  bound  on  the  fluctuations  obtained  by  McDiarmid’s  inequality  is  coun¬ 
terintuitive:  E  [Z\  decreases  with  m  but  the  variance  proxy  in  McDiarmid’s 
inequality  increases  with  m!  Using  Talagrand’s  concentration  inequality,  we 
can  obtain  a  better  bound  on  the  fluctuations  of  Z . 

c.  Use  Talagrand’s  inequality  to  show  that  Z  is  n  A  m-subgaussian. 

Hint:  let  fm(b\, . . .  ,bm)  be  the  number  of  nonempty  bins  if  we  put  ball 
i  in  bin  bi,  and  note  that  fm(bi,  ■  ■  ■ ,  bm)  —  EHi  -  iorj<i-  Show  that 
fm(b)  <  f2m(b'1,b1,  .  .  .  ,  b'm,  bm)  <  fm(V)  +  EH  1  for  j<i  • 

4.8  (Travelling  salesman  problem).  Let  Xi, . . . ,  Xn  be  i.i.d.  points  that 
are  uniformly  distributed  in  the  unit  square  [0, 1] 2 .  We  think  of  Xi  as  the 
location  of  city  i.  The  goal  of  the  travelling  salesman  problem  is  to  find  a  tour 
through  all  n  cities  with  the  shortest  possible  length.  We  denote  by 

Ln  :=  nunjHX^!)  —  -Et(2)||  +  ||-Er(2)  ~  -Er(3)||  +  •  • '  +  ll^o-(n)  —  -Er(i)||} 

the  length  of  the  shortest  tour,  where  the  minimum  is  taken  over  all  permu¬ 
tations  of  {1, . . . ,  n}.  Let  us  begin  by  investigating  the  magnitude  of  Ln. 

a.  Show  that  E[L„]  x  i/n . 

Hint:  argue  that  Ln  >  EEi  min ||Xfc  —  Xi\\  for  the  lower  bound  and 
Ln  <  Ln_ i  +  2 miiq,<n  \\Xn  —  X^  for  the  upper  bound. 

b.  Use  McDiarmid’s  inequality  to  show  that  Ln  is  2n-subgaussian. 

The  bound  using  McDiarmid’s  inequality  is  terrible:  it  yields  an  upper  bound 
on  the  magnitude  of  the  fluctuations  that  is  of  the  same  order  as  the  mean. 
Thus  McDiarmid’s  inequality  does  not  even  show  that  Ln  concentrates  around 
its  mean.  Using  Talagrand’s  inequality,  we  will  be  able  to  obtain  a  much 
sharper  concentration  result.  This  requires  some  geometric  insight. 

c.  Let  v  =  (0,  a)  and  w  —  (6,  0)  be  corners  of  a  right-angled  triangle  T  = 
conv{0,  v,  re}.  Show  that  ||u  —  x\\2  +  ||a:  —  iu||2  <  ||u  —  w ||2  for  any  x  £  T. 

d.  Prove  the  following:  for  any  X\ , . . .  ,xn  £  T,  there  is  a  permutation  a  such 
that  \\v  -  Xa{1)\\2  +  EEl1  IK(i)  -  xa(i+l)\\2  +  \\xa[n)  -  w ||2  <  ||u  -  w||2. 
Hint:  argue  by  induction.  Suppose  the  result  is  true  for  all  right-angled 
triangles  S  and  x±, . . .  ,xn-i  £  S.  Divide  T  into  two  right-angled  triangles 
by  drawing  a  line  from  the  origin  to  the  hypothenuse.  If  both  triangles 
contain  points,  then  use  the  induction  hypothesis  to  conclude.  Otherwise, 
continue  subdividing  until  the  induction  hypothesis  applies. 

e.  Conclude  that  for  any  points  x\, . . . ,  xn  €  [0, 1] 2,  there  exists  a  permutation 

(7  such  that  ||x(j(1)  -  Xa{2)  II2  +  11^(2)  -  xcr(3)  ||2  4 - k  \\x„(n)  -  Xa(l)  II2  <  4. 
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We  are  now  going  to  use  this  geometric  insight  to  analyze  the  length  of  trav¬ 
elling  salesman  tours.  Recall  that  a  tour  through  a  set  of  points  xi,...,xn  is 
defined  by  a  permutation  a  of  {1, . . . ,  n}.  The  length  of  a  given  tour  will  be 
denoted  as  ln(x ,  a),  so  we  have  Ln  :=  min^  ln(X,  a). 

f.  Let  x  =  {xi, . . . ,  Xn}  and  y  =  {y\, . . . ,  yn}  be  sets  of  points  with  xDy  ^  0. 
Let  a  be  a  tour  of  x  and  r  be  a  tour  of  y.  Show  that  there  exists  a  tour 
p  of  x  U  y  such  that  l2n(x  U  y,p)  <  ln(y,T)  +  2^=1 1  Xi£ydi{x,cr),  where 
di(x,  a)  is  the  distance  between  Xi  and  the  previous  point  in  the  tour  a. 
Hint:  imagine  a  and  r  are  two  partially  overlapping  hiking  trails  marked 
red  and  blue.  Your  aim  is  to  systematically  explore  the  union  of  the  trails. 
To  this  end,  perform  the  following  walk:  start  walking  the  blue  trail;  if  at 
any  point  the  red  trail  diverges  from  the  blue  trail,  walk  down  the  red  trail 
until  just  before  it  hits  the  blue  trail  again,  then  walk  back  to  where  you 
diverged  from  the  blue  trail  and  continue  down  the  blue  trail.  While  this 
walk  is  not  a  tour  (as  some  points  are  visited  twice),  you  can  “straighten  it 
out”  into  a  genuine  tour  without  increasing  its  length. 

g.  Fix  for  every  X\, ...  ,xn  £  [0,  l]2  a  tour  ax  as  in  part  e.  above.  Show  that 
mhv  ln(x,  a)  <  miiv  ln(y,  a)  +  Yh=i  2 di(x,  ax)lXi7tVi  for  all  x,y  £  [0,  l]2n. 

h.  Conclude  that  Ln  is  16-subgaussian  for  every  n  >  1. 

4.9  (Convexity  and  Euclidean  concentration).  Corollary  4.23  shows  that 
convex  Lipschitz  functions  of  bounded  independent  variables  concentrate  in 
the  same  manner  as  Lipschitz  functions  of  Gaussian  random  variables.  How¬ 
ever,  in  the  Gaussian  case,  convexity  is  not  needed.  The  goal  of  this  problem 
is  to  show  that  convexity  is  in  fact  essential  in  the  setting  of  Corollary  4.23. 

Let  {Xk  :  k  >  1}  be  i.i.d.  symmetric  Bernoulli  variables  P[Y,;  =  ±1]  =  1. 
Consider  for  each  n  >  1  the  function  fn(x )  =  d(x,An)  on  K",  where 

A"={j,e{-l,l}":f><0 
l  i=l 

and  d(x,A)  :=  inf.yg^  \\x  —  y||.  Note  that  the  function  fn(x )  is  not  convex. 

a.  Show  that  fn  is  1-Lipschitz  with  respect  to  the  Euclidean  distance  on  Rn. 

b.  Show  that  med[/”(Xi, . . . ,  Xn)\  =  0. 

c.  Show  that  if  x  £  {—1, 1}"  satisfies  xi  —  then 

n  n 

\fn  <  y^(x»  -  yt)  <  ^2  I xi  ~  Vi  I2  for  V  e  A- 

i= 1  i= 1 

In  particular,  this  implies  fn{x)  >  n1/4. 
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d.  Show  that 

lim  inf  P  Xn)  An1/4]  >0. 

n — »oo 

Argue  that  this  implies  that  fn(X \,...,Xn)  cannot  be  subgaussian  with 
variance  proxy  independent  of  the  dimension  n. 

e.  Show  that  if  g  is  convex  and  1-Lipschitz  with  respect  to  the  Euclidean 
distance  on  Rra,  then  g(X i,...,Xn)  is  4-subgaussian  (independent  of  di¬ 
mension  n).  In  view  of  the  above,  convexity  is  evidently  essential. 


4.4  Dimension-free  concentration  and  the  T2-inequality 


In  the  previous  sections  we  have  obtained  a  complete  characterization  of  the 
concentration  of  Lipschitz  functions  on  a  fixed  metric  space  in  terms  of  trans¬ 
portation  cost  inequalities  (Theorem  4.8),  and  we  have  developed  a  tensoriza- 
tion  principle  for  such  inequalities  (Theorem  4.15).  Together,  these  two  prin¬ 
ciples  allow  us  deduce  concentration  of  independent  random  variables  in  the 
following  manner.  Suppose  that  A,  ~  /q  on  (Xi,<i,;)  are  such  that 

f(Xi )  is  1-subgaussian  when  \f(x )  -  f(y)\  <  di(x,y ), 
and  that  Xi, ,  Xn  are  independent.  Then  we  have  for  any  E"=1  cf  —  1 

n 

f{X i, . . . , Xn)  is  1-subgaussian  when  \f(x)  -  f(y)\  <  '^2,cidi(xi,yi). 

i—1 


This  suffices  to  recover,  for  example,  McDiarmid’s  inequality. 

However,  in  the  previous  chapters,  we  have  seen  examples  that  exhibit 
substantially  better  concentration  properties  than  is  suggested  by  this  general 
principle.  For  example,  let  Xi  ~  iV(0, 1)  on  X,;  =  R.  Then  the  Gaussian 
concentration  property  states  not  only  that  each  Xi  exhibits  the  Lipschitz 
concentration  property  with  respect  to  the  metric  di(x,y)  =  \x  —  y |,  but  also 


f(X i, . . . ,  Xn)  is  1-subgaussian  when  |  f{x)  -  f(y)\  < 


1 

n  1  2 

'y  \  di{xi,  yi) 

.  i= 1 


Thus  we  even  have  dimension-free  concentration  for  independent  Gaussian 
variables  with  respect  to  the  Euclidean  distance  d(x,y)  =  E*  di{xi>  Vi)2]1  2 
rather  than  just  the  weighted  G-distance  dc{x,y)  =  'Yhicidi{xi,yi).  This  is  a 
much  stronger  conclusion:  indeed,  any  1-Lipschitz  function  with  respect  to  dc 
is  1-Lipschitz  with  respect  to  d,  but  a  function  that  is  1-Lipschitz  with  respect 
to  d  may  not  be  better  than  y^-Lipschitz  with  respect  to  dc. 

At  first  sight,  the  fact  that  we  do  not  capture  concentration  with  respect 
to  the  Euclidean  distance  might  appear  to  be  an  inefficiency  in  our  approach. 
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One  might  hope  that  the  conclusion  of  Theorem  4.15  can  be  improved  to  yield 
a  statement  of  the  following  form:  if 

W\{ni,v)  <  ^/2a2D(iy\\/j,i)  for  all  v 


holds  for  each  /i,  on  (X,,  d,),  then  for  any  n  >  1 


®  •  •  •  <8>  fin,  v)  <  \j2a2D(v ||pi  ®  •  •  •  <g>  /x„)  for  all  ^ 

holds  for  /ii  ®  •  •  •  ®  /in  on  (Xi  x  •  •  •  x  X„,  E"=1  d2]1/2).  However,  this  con¬ 
clusion  is  false:  in  general,  it  is  not  true  that  a  distribution  that  exhibits  the 
Lipschitz  concentration  property  in  one  dimension  will  exhibit  dimension-free 
concentration  with  respect  to  the  Euclidean  distance.  For  example,  we  have 
seen  in  Problem  4.9  that  this  conclusion  fails  already  for  symmetric  Bernoulli 
variables.  Thus  dimension- free  Euclidean  concentration  is  a  strictly  stronger 
property  than  is  guaranteed  by  Theorem  4.8.  In  this  section,  we  will  show  that 
the  latter  property  can  nonetheless  be  characterized  completely  by  means  of 
a  stronger  form  of  the  transportation  cost  inequality. 

In  order  to  develop  improved  concentration  results,  we  must  first  identify 
where  lies  the  inefficiency  of  our  previous  tensorization  argument.  Recall  that 

W\{ni,v)  <  \/2a2D{v\\^i)  for  all  v,  i 


implies,  using  Theorem  4.15  with  ip(x)  =  x 2  and  Wi(x,y)  =  di(x,y),  that 


inf  Vem  [diiXuYi) 

l—l 


1/2 


<  \/2cr2D(zz||/ii  ®  ®  Hn). 


The  problem  with  this  expression  is  that  the  left-hand  side  is  not  a  Wasserstein 
distance.  We  resolved  this  problem  in  Corollary  4.16  by  applying  the  Cauchy- 
Schwarz  inequality.  Such  a  brute-force  solution  can  only  yield  a  transportation 
cost  inequality  in  terms  of  weighted  £i-distance,  however.  On  the  other  hand, 
note  that  the  quantity  on  the  left-hand  side  is  already  tantalizingly  close  to 
a  Euclidean  transportation  cost  inequality:  if  only  Em [di{Xi:Yi)]2  could  be 
replaced  by  Em [di(Xi,Yi)2],  we  would  immediately  deduce 

W\(n i  ®  ■  ■  ■  <g>  fin,  v)  <  \/2o-2-D(i'||pi  ®  •  •  •  <g>  nn)  for  all  v 

on  (Xi  x  •  •  •  x  Xn,  E"=i  d!]1/2)  by  Jensen’s  inequality.  Given  the  technology 
that  we  have  already  developed,  can  easily  engineer  this  situation  by  starting 
from  a  slighly  stronger  inequality  in  one  dimension. 

Definition  4.29  (Quadratic  Wasserstein  metric).  The  quadratic  Wasser¬ 
stein  metric  for  probability  measures  fi,  v  on  a  metric  space  (X,  d)  is 


W2(y,v):=  inf  v'E  [d(X,Y)2\. 

Mee(/i,i/) 
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Corollary  4.30  (T2-inequality).  Suppose  that  the  probability  measures  ji,, 
on  (X,;,  dj)  satisfy  the  quadratic  transportation  cost  (T2)  inequality 

W2{tk,  v)  <  ^2a2D(v\\pi)  for  all  u. 


Then  we  have 

W2{ti i  <g>  ■  ■  ■  <g>  tin,  v)  <  tin)  for  all  v 

on  (XrX-.-xX^Eti^]172)- 

Proof.  Apply  Theorem  4.15  with  <£>(a;)  =  x  and  Wi(x,y)  =  di(x,y)2.  □ 

By  Jensen’s  inequality,  we  evidently  have 
W!(n,v)<  inf  E M[d(X,Y)}<  inf  v/EM[d(X,  Y)2}  =  W2(/z,i/). 

Mee(p,4  Mee(/r,y) 

The  T2-inequality  is  therefore  a  stronger  assumption  than  the  transportation 
cost  inequalities  (or  Tf-inequalities)  that  we  have  considered  so  far.  On  the 
other  hand,  combining  Corollary  4.30  and  Theorem  4.8  shows  that  if  each 
measure  /i,  satisfies  a  ^-inequality,  then  the  product  measure  pi  <8>  ■  ■  •  <S>  Hn 
satisfies  the  Lipschitz  concentration  property  with  respect  to  the  Euclidean 
distance  d  =  E,  df]1/2,  which  is  a  much  stronger  conclusion  than  could  be 
deduced  from  the  Ti-inequality.  We  have  therefore  obtained  a  sufficient  con¬ 
dition  for  dimension- free  Euclidean  concentration. 

We  could  verify  at  this  point  that  the  Gaussian  distribution  satisfies  the 
TVinequality,  so  that  the  improved  tensorization  principle  of  Corollary  4.30 
is  sufficiently  strong  to  capture  Gaussian  concentration  (see  Problems  4.10 
and  4.11).  This  explains  why  the  Gaussian  distribution  exhibits  better  con¬ 
centration  properties  than  were  predicted  by  Corollary  4.16.  Instead,  we  will 
presently  prove  a  remarkable  general  fact:  the  T2-inequality  is  not  only  suffi¬ 
cient,  but  also  necessary  for  dimension-free  Euclidean  concentration  to  hold! 

Theorem  4.31  (Gozlan).  Let  p  be  a  probability  measure  on  a  Polish  space 
(X,  d),  and  let  {A,}  be  i.i.d.  ~  /i.  Denote  by  dn(x,y)  :=  E™=i  d(a:.j,  y,;)2]1/2 
the  Euclidean  metric  on  X".  Then  the  following  are  equivalent: 

1.  fi  satisfies  the  T2~inequality  on  (X,  d): 

W2(ti,  v)  <  \Z2a2D(u\\ti)  for  all  v. 

2.  p®n  satisfies  the  Ti-inequality  on  (X”,d„)  for  every  n  >  1; 

W\ v)  <  yj2cr2D{y\\n®n)  for  all  v ,  n  >  1. 

3.  There  is  a  constant  C  such  that 

P[/(A1,...,A„)-E/(A1,...,A„)  >t\  <Ce~t2/2rj2 
for  every  n  >  1,  t  >  0  and  1-Lipschitz  function  f  on  (X",  dn). 
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Let  us  emphasize  that  this  striking  result  is  quite  unexpected.  While  The¬ 
orem  4.8  shows  that  Lipschitz  concentration  on  a  fixed  metric  space  is  char¬ 
acterized  by  the  Tf-inequality,  the  necessity  in  Theorem  4.8  has  little  bearing 
on  the  behavior  of  the  quadratic  Wasserstein  metric.  The  necessity  of  the 
TVinequality  in  Theorem  4.31  has  a  different  origin:  it  is  a  consequence  of  a 
classical  large  deviation  result  in  probability  theory. 

Theorem  4.32  (Sanov).  Let  /i  be  a  probability  measure  on  a  Polish  space 
X,  and  let  {X;}  be  i.i.d.  ~  p.  Let  O  be  a  set  of  probability  measures  on  X  that 
is  open  for  the  weak  convergence  topology.  Then 


lim  inf  —  log  P 

n—>  oo  77, 


1 

n 


E  s*k  e  0 

k= 1 


>  -  inf  D(v\\p). 

i/GO 


Remark  f.33.  We  have  only  stated  half  of  Sanov’s  theorem:  a  matching  upper 
bound  can  be  proved  also  (see  Problem  4.12  below).  However,  only  the  lower 
bound  will  be  needed  in  the  proof  of  Theorem  4.31. 


Proof.  Fix  v  €  O  such  that  D(v\\p)  <  oo.  Let  /  =  dv/dp,  and  let  Q  be  the 
probability  under  which  {X,}  are  i.i.d.  ~  v.  As  /  >  0  iz-a.s.,  we  can  estimate 


k= 1 


>  P 


n /(**)> ° 


fc=i 


fc=i 


=  E, 


Q 


>  e 


lU2^=1SxkeoY[f(xk) 
fe= l 

1 


-«{/ log  /  dv+e} 


Q 


n 


-E  lo gf{Xk)  <  /  log  fdv  +  e 


k= 1 


k= 1 


Note  that  Jlogfdu  =  D(is |  |/x),  while  we  have  by  the  law  of  large  numbers 
\  ELi  lo§  f(xk)  ->  flog  f  dv  and  f  Yfk=\  Sxk  v  weakly  Q-a.s.  Thus  the 
probability  in  the  last  line  converges  to  one,  and  it  follows  readily  that 


lim  inf  —  log  P 

n—>  oo  77 


1 

77 


E  s*k e  ° 

k— 1 


>  —D(y\ \p)  -e. 


It  remains  to  let  e  J,  0  and  take  the  supremum  over  all  v  £  O.  □ 


We  are  now  ready  to  prove  Theorem  4.31.  The  proof  of  a  few  technical 
results  that  will  be  needed  along  the  way  is  deferred  to  the  end  of  this  section. 

Proof  (Theorem  4.. 31).  We  already  proved  1  =>  2  in  Corollary  4.30,  while 
the  implication  2  =>  3  with  C  =  1  follows  from  Theorem  4.8  and  the  usual 
Chernoff  bound.  It  therefore  remains  to  prove  3  =>  1. 

We  will  need  the  following  three  facts  that  will  be  proved  below. 
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1.  Wasserstein  law  of  large  numbers:  E[W2(^  Yl'k-i  $xk,  /x)]  — >  0  as  n  — *  oo. 

2.  Lower-semicontinuity:  Ot  :=  {V  :  ft)  >  t}  is  an  open  set. 

3.  Smoothness:  gn  :  (*1, . . .  ,xn)  1— >  W2(^  Uk=i  ^xk,p)  is  n_1/,2-Lipschitz. 

The  first  two  claims  are  essentially  technical  exercises:  ^  Y^ik=  1  $xk  converges 
weakly  to  ji  by  the  law  of  large  numbers,  so  the  only  difficulty  is  to  verify  that 
the  convergence  holds  in  the  slightly  stronger  sense  of  the  quadratic  Wasser¬ 
stein  distance;  and  lower-semicontinuity  of  W2  is  an  elementary  technical  fact. 
The  third  claim  is  a  matter  of  direct  computation,  which  we  will  do  below. 
Let  us  presently  take  these  claims  for  granted  and  complete  the  proof. 

As  Ot  is  open,  we  can  apply  Sanov’s  theorem  to  conclude  that 

-  inf  D(v  | |/x)  <  lim  inf  -  log  P  [gn  (X1}...,  Xn)  >  t] . 

veOt  n— >00  n 

As  the  function  gn  is  n_1/2-Lipschitz,  however,  we  have 

V[gn{XU  ...,Xn)>t]<  <7e-™(*-E[9n(*  i,...,X„)])2/2a2 


by  the  dimension-free  concentration  assumption.  This  implies 

.  ,.  (t^E[gn(x1,...,xn)})2  t2 

-  inf  D{v\\n)  <  -hmsup - — ~2 - = 

v£Ot  ri — »oo 

using  the  Wasserstein  law  of  large  numbers.  Thus  we  have  proved 

'j2a2D(v\\g)  >  t  whenever  Wzig,  v)  >  t. 

The  T2-inequality  follows  by  choosing  t  =  u)  —  e  and  letting  e  J.  0.  □ 

It  remains  to  establish  the  three  claims  used  in  the  proof.  We  begin  with  the 
Lipschitz  property  of  gn ,  which  follows  essentially  from  the  triangle  inequality. 

Lemma  4.34.  gn  :  x  1— >  Y^k=i  SXk,g)  is  ?x-1/2 -Lipschitz  on  (Xn,  dn). 

Proof.  Let  M  <G  C(^  J2i=i  ^xnt *)•  If  we  define  /x,  =  M[F  e  •  \X  =  xf\,  then 

Em[/(A,F)]  =  /  f(xuy)  m(dy),  -^2ni  =  g. 

n  z— f  J  n 


i=  1 


i= 1 


Conversely,  every  family  of  measures  . . . ,  /x„  with  ^  ^"=1  =  /x  defines  a 

coupling  M  €  e(i  J27=i  Sxi,p)  in  this  manner.  We  can  therefore  estimate 
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d(xiiy)2vi{dy)  -  d(x^y)2^(dy) 

i— 1  J  J  L  i— 1  17 

l  n  r 

~222  J  id(x^y)  ~  d(xhy)}2tj'i(dy) 


<  sup 

H  H?=i  AU=M 


<  sup 
n  5D?=i  Mi=M 


< 


1 


y^d(xj,Xj)2 


i=l 


where  in  the  last  two  lines  we  used,  respectively,  the  reverse  triangle  inequality 
for  L 2  norms  (that  is,  ||X||2  —  ||U||2  <  ||X  —  Y ||2)  and  for  the  metric  d.  □ 


Next,  we  establish  lower-semicontinuity  of  W-i .  The  proof  of  this  technical 
lemma  is  little  more  than  an  exercise  in  using  weak  convergence. 

Lemma  4.35.  v  i— >  W2{v,p)  is  lower-semicontinuous  in  the  weak  topology. 


Proof.  Let  vn  — +  v  weakly  as  n  — »  oo.  We  must  show  that 


lim  inf  W2  {vn ,  p)  >  W2  (v,p). 

n—>  oo 

Fix  e  >  0,  and  choose  for  each  n  a  coupling  M„  G  <Z(vn,p)  such  that 

W2(l/n,  d)  >  M„[d(X,F)2]  -  £. 

We  claim  that  the  sequence  {M„}  is  tight.  Indeed,  the  sequence  {vn}  is  tight 
(as  it  converges)  and  clearly  p  is  itself  tight.  For  any  5  >  0,  choose  a  compact 
set  Kg  such  that  vn(Kg)  >  1  —  5/2  for  all  n  >  1  and  p(Kg)  >  1  —  5/2.  Then 
evidently  M„(iv^  x  Kg)  >1  —  5,  and  thus  tightness  follows. 

Using  tightness,  we  can  choose  a  subsequence  |  oo  such  that  M„fc  — >  M 
weakly  for  some  M  G  C (u,  p)  and  lim  inf„  W2(vn,  p)  =  lim^  W2(vnk ,  p).  As  the 
metric  d  is  continuous  and  nonnegative,  we  obtain 

lim  inf  W2(un,  p)  >  lim  inf  ./em„.  [d(X,  Y )2]  —e>  \/EM[ri(X,  Y)2}  -  s. 

n— >oo  k— >oo  V  k 

Thus  lim  inf n  W2(vn,  p)  >  W2{y,  p)  —  e,  and  we  conclude  by  letting  e  j  0.  □ 

Finally,  we  prove  the  Wasserstein  law  of  large  numbers.  As  the  classical 
law  of  large  numbers  already  implies  that  1  Y^k=i  dxk  — >  P  weakly,  this  is 
almost  obvious.  The  only  issue  that  arises  here  is  that  convergence  in  W2  is 
stronger  than  weak  convergence,  as  it  implies  convergence  of  expectations  of 
unbounded  functions  with  up  to  quadratic  growth.  Proving  that  this  is  indeed 
the  case  under  the  assumption  of  Theorem  4.31  is  an  exercise  in  truncation. 

Lemma  4.36.  Suppose  that  p  satisfies  condition  3  of  Theorem  f.31.  Then  we 
have  E[  IU2 ( ^  1  dxk,  p)]  - »  0  as  n  — >  00  when  {X;}  are  i.i.d.  p. 
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Proof.  Let  x* 
W2(n,  v)2  = 

< 

using  (b  +  c)3 


£  X  be  some  arbitrary  point.  We  truncate  as  follows: 


inf  {EM[d(X,Y)2ld(xx)<a}  +EM[d(X,Y)2ld(x,Y)>a}} 
Mee(/r,^) 


a  inf  Em[4^!  F)  A  ol  + 
Mee(/i,i/) 


4  f  d(x,x*)3{p(dx )  +  is(dx)} 


<  4 (b3  +  c3)  for  b,  c  >  0.  We  claim  that  if  vn  — >  p  weakly,  then 


inf  Em [d(X,  Y)  A  a]  0. 


Indeed,  by  the  Skorokhod  representation  theorem,  we  can  construct  random 
variables  {Xn}  and  X  on  a  common  probability  space  such  that  Xn  ~  un,  X  ~ 
/j,  and  Xn  — >  X  a.s.  Thus  E[d(Xn,  X)  A  a]  — >  0  by  bounded  convergence,  and 
as  the  joint  law  of  Xn .  X  is  in  Q(vn,  p)  the  claim  follows.  Thus  vn  — *  p  implies 
W2  ( vn ,  p)  — >  0  if  we  can  control  the  second  term  in  the  above  truncation. 

Recall  that  pn  =  p  ^”=1  satisfies  — >  p  weakly  a.s.  by  the  law  of 

large  numbers.  Therefore,  following  the  above  reasoning,  we  obtain 


limsup E\W2(pn,  /r)2]  < 

n—>  oo 


8  /  d(x,x*)3  p(dx) 
a 


for  every  a  >  0.  Thus  the  result  follows  by  letting  a  — »  oo,  provided  we  can 
show  that  /  d(x,  x*)3p{dx)  <  oo.  But  as  i  n  d(x,x*)  is  1-Lipschitz,  this 
follows  readily  from  condition  3  of  Theorem  4.31.  □ 


We  have  now  proved  all  the  facts  that  were  used  above  to  establish  Theo¬ 
rem  4.31.  The  proof  of  Theorem  4.31  is  therefore  complete. 


Problems 


4.10  (The  Gaussian  T2-inequality).  As  we  have  already  proved  the  Gaus¬ 
sian  concentration  property  using  the  entropy  method,  Theorem  4.31  implies 
that  the  standard  Gaussian  distribution  N( 0, 1)  on  R.  must  satisfy  the  T2- 
inequality.  It  is  instructive,  however,  to  give  a  direct  proof  of  this  fact.  By 
Theorem  4.31,  this  yields  an  alternative  proof  of  Gaussian  concentration. 

Fix  X  ~  p  =  iV(0, 1)  and  v  <C  p.  Denote  their  cumulative  distribution 
functions  as  F(t)  =  PM[X  <  t]  and  G(t )  =  P„[A  <  t\,  and  let  p  :=  G~l  o  F. 

a.  Show  that 


W2(p,  v)  <  E[|X  —  <p(A)|2]1/2, 


D(y  M/t)  =  E 


log  ~T~(P(X)) 


dfi 


b.  Show  that 
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c.  Use  Gaussian  integration  by  parts  (Lemma  2.24)  to  show  that 

2£(HIm)  =  n\X  -  y{X)\2\  +  2E[^(X)  -  1  -  log^pQ], 

and  conclude  that  N( 0, 1)  satisfies  the  T2-inequality  with  cr  =  1. 

4.11  (Stochastic  calculus  and  the  Gaussian  TVinequality).  The  goal 
of  this  problem  is  the  give  an  alternative  proof  of  the  Gaussian  TVinequality 
using  stochastic  calculus.  The  method  developed  here  can  be  extended  to 
prove  the  TVinequality  for  the  laws  of  diffusion  processes.  For  the  purposes  of 
this  problem,  we  assume  the  reader  is  already  familiar  with  stochastic  calculus. 

Fix  n  =  iV(0, 1)  and  v  <C  /i.  Let  {Wt}te[ o,i]  be  standard  Brownian  motion 
under  P,  and  define  the  probability  measure  dQ  =  ^(Wi)cfP. 

a.  Show  that  for  some  nonanticipating  process  {/3t}te[o,i] 

^(TUi)  =  exp  (^J  Pt  dWt  j  Pt  dtj  ■ 

Hint:  use  the  martingale  representation  theorem  and  Ito’s  formula. 

b.  Show  that  {Tt}te[o,i]  is  Brownian  motion  under  Q,  where 

Yt  :=  Wt  —  f  ps  ds. 

Jo 

c.  Argue  that 

W2{^v)2<  Eq  [  p2dt 

.JO 

d.  Give  a  careful  proof  of  the  identity 

D(y\\n)  =Eq  ^  J  P2dt  . 

Conclude  that  N(0, 1)  satisfies  the  T2-inequality  with  cr  =  1. 

4.12  (Sanov’s  theorem).  We  proved  in  Theorem  4.32  half  of  Sanov’s  theo¬ 
rem.  The  other  half  yields  a  matching  upper  bound:  if  C  is  a  set  of  probability 
measures  on  X  that  is  compact  for  the  weak  convergence  topology,  then 

1  \l  n 

limsup  — logP  —  Sxk  €  C  <  —  inf  D(y\\n). 

n — >oo  n  Tl  vtzC 

k=  1 

Sanov’s  theorem  therefore  shows  that  relative  entropy  controls  the  exact 
asymptotic  behavior,  on  a  logarithmic  scale,  of  the  probability  that  empir¬ 
ical  measures  take  values  in  a  (sufficiently  regular)  unlikely  set. 

While  only  the  lower  bound  in  Sanov’s  theorem  is  needed  in  the  proof  of 
Theorem  4.31,  it  is  instructive  to  prove  the  upper  bound  as  well. 


4.4  Dimension-free  concentration  and  the  T2-inequality 


105 


a.  Show  that  for  any  probability  measure  v  and  bounded  function  / 


-  logP 
n 


“  5Z  f(Xk)  >  J  fdv  <  log  J  e1  dp-  J  f  dv. 

k—1  J 


b.  Fix  e  >  0.  Use  the  variational  formula  for  entropy  to  show  that  for  any 
probability  measure  v,  there  is  a  bounded  continuous  function  /„  such  that 


—  logP 


l  .w  r 

~^2U{xk)  >  /  fvdv 


fc= i 


<  —D(v\\n)  +  e. 


c.  Show  that  if  C  is  compact,  then  it  can  be  covered  by  a  finite  number  of 
sets  of  the  form  {p  :  /  fvdp  >  J  f„dv}  with  v  €  C. 


d.  Conclude  the  proof  of  the  upper  bound  in  Sanov’s  theorem. 

4.13  (T2-inequality  and  log-Sobolev  inequalities).  We  have  developed 
two  completely  different  methods  to  obtain  concentration  inequalities:  the 
entropy  method  and  the  transportation  method.  The  goal  of  this  problem  is 
to  develop  some  connections  between  the  two. 

a.  Suppose  that  a  probability  p  on  Rd  satisfies  the  log-Sobolev  inequality 

2 

EntM[e/]  <  y  E^HV/llV]  for  all  /. 

Show  that  this  implies  that  p  also  satisfies  the  T2-inequality. 

By  Theorem  4.31,  the  TVinequality  is  equivalent  to  dimension-free  Euclidean 
concentration.  We  have  just  shown  that  the  log-Sobolev  inequality  implies 
the  T2-inequality.  One  might  hope  that  the  converse  is  also  true,  that  is, 
that  T2  implies  log-Sobolev  for  probability  measures  on  Rd.  This  proves  to 
be  false,  however:  log-Sobolev  is  strictly  stronger  than  T2.  It  is  possible  to 
provide  an  explicit  example  that  satisfies  T2  but  not  log-Sobolev  (e.g.,  p(dx)  oc 
e-M3-M 9/j,-3x2  sm2  x^x  Qn  jy  but  we  omit  the  tedious  verification  of  this  fact. 

Remarkably,  however,  it  is  easy  to  show  that  if  p  satisfies  the  T2-inequality, 
then  it  also  satisfies  the  log-Sobolev  inequality  for  convex  functions.  Moreover, 
for  concave  functions,  the  log-Sobolev  inequality  can  even  be  improved! 

a.  Show  that  for  any  measure  p  and  function  /, 

j  fdv- j  with  *- = 

b.  Show  that 

<  inf  Em[V/(X)  •  (X  -  F)]  for  convex /, 

EM[eU  MeeM 

<  inf  EM[V/(y)  •  (X  —  Y-)]  for  concave /. 

E^[e4J  Mee(y.M) 


106  4  Lipschitz  concentration  and  transportation  inequalities 


c.  Conclude  that  if  //  satisfies  the  T2-inequality,  then 

Ent ,j[ef]  <  2 cr2  EM[|| Vf\\2ef]  for  convex  /, 

Ent,j[ef}  <  2 cr2  EM[||V/||2]  E^e^]  for  concave  /. 

d.  Deduce  a  version  of  the  Gaussian  concentration  property  (Theorem  3.25) 
for  concave  functions  with  improved  variance  proxy. 

4.14  (Inf-convolution  inequalities).  The  goal  of  this  problem  is  to  develop 
an  alternative  formulation  of  the  T2-inequality  that  is  particularly  useful  for 
analysis  of  probability  measures  on  Rd.  Before  we  state  this  alternative  for¬ 
mulation,  we  must  develop  an  analogue  of  Monge-Kantorovich  duality  for  W2. 

a.  Let  (X,  d)  be  a  separable  metric  space.  Show  that 

W2{n,  v)2  =  sup  {E vg  -  E M/}. 

g(x)-f(y)<d(x,y)2 

Hint:  emulate  the  proof  of  Theorem  4.13  and  Problem  4.3. 

For  any  function  /,  define  the  inf- convolution 

Qtf(x)  :=  inf  j/(y)  +  ^d(z,?/)2j. 

We  will  show  that  for  any  probability  //  on  a  separable  metric  space  (X,  d), 
W2(ii,v)  <  \/2<j2D(v\ \fi)  for  all  v  iff  <  1  for  all  /. 

The  latter  inequality  is  called  an  inf- convolution  inequality. 

b.  Prove  the  equivalence  between  the  T2  and  inf-convolution  inequalities. 
Hint:  emulate  the  proof  of  Theorem  4.8. 

Let  yi  be  a  probability  measure  on  that  satisfies  the  T2-inequality.  We  have 
seen  above  that  this  does  not  necessarily  imply  that  /i  satisfies  a  log-Sobolev 
inequality.  However,  we  will  presently  show  that  /i  must  at  least  satisfy  a 
Poincare  inequality  whenever  the  T2-inequality  holds. 

c.  Given  any  sufficiently  smooth  function  /  :  R.d  — >  R,  show  that  the  function 
v(t,  x )  =  Qt.fix)  is  the  (Hopf-Lax)  solution  of  the  Hamilton- Jacobi  equation 

+  ^HV'cll2  =  0,  t>(0  ,•)  =  /• 

d.  Show  that  if  a  probability  /r  on  satisfies  the  T2-inequality,  then 

VarM[/]  <  a%[||V/||2]  for  all  /. 

Hint:  apply  the  inf-convolution  inequality  to  tf  and  expand  around  t  =  0. 


4.4  Dimension-free  concentration  and  the  T2-inequality 


107 


Notes 

§4.1.  Historically,  the  metric  approach  to  concentration  was  the  first  to  be 
developed.  The  formulation  in  terms  of  Lipschitz  functions  dates  back  to  the 
first  proof  of  the  Gaussian  concentration  property  due  to  Tsirelson,  Ibragimov, 
and  Sudakov  [90]  using  stochastic  calculus,  while  the  fundamental  importance 
of  Lipschitz  concentration  and  its  connection  with  with  isoperimetric  problems 
(Problem  4.2)  was  emphasized  and  systematically  exploited  by  Milman  in  the 
context  of  Banach  space  theory  [62].  A  comprehensive  treatment  of  these 
ideas  can  be  found  in  [50].  Theorem  4.8  is  due  to  [11].  The  Gibbs  variational 
principle  dates  back  to  the  inception  of  statistical  mechanics  [39,  Theorem  III, 
p.  131].  Pinsker’s  inequality  is  a  basic  fact  in  information  theory  [20]. 

§4.2.  The  texts  by  Villani  [97,  98]  are  a  fantastic  source  on  optimal  trans¬ 
portation  problems  and  their  connections  with  other  areas  of  mathematics. 
An  elementary  introduction  to  linear  programming  duality  is  given  in  [36] 
(in  fact,  linear  programming  duality  was  invented  by  Kantorovich  in  order 
to  prove  Theorem  4.13,  see  [94]  for  historical  comments).  The  continuous  ex¬ 
tension  in  Problem  4.3  was  inspired  by  the  treatment  in  [31].  The  optimal 
coupling  for  the  trivial  metric  was  constructed  in  [25]. 

The  transportation  method  for  proving  concentration  inequalities  is  due 
to  Marton  [54],  Both  the  tensorization  method  and  Problem  4.5  are  from  [54]. 
The  general  formulation  of  Theorem  4.15  given  here  was  taken  from  [13]. 

§4.3.  Talagrand’s  concentration  inequality  was  developed  in  [76,  80]  in  an 
isoperimetric  form  in  terms  of  a  “convex  distance”  from  a  point  to  a  set  (an 
entire  family  of  related  inequalities  is  obtained  there  as  well) .  A  detailed  expo¬ 
sition  of  these  results  can  be  found  in  [84,  50].  It  was  realized  by  Marton  [55] 
that  Talagrand’s  inequality  can  be  proved  using  the  transportation  method 
using  the  asymmetric  “distance”  d2 ,  and  the  proof  we  give  is  due  to  her  (with 
a  simplified  proof  for  n  =  1  due  to  Samson  [71]).  The  more  general  inequalities 
from  [80]  can  also  be  recovered  by  the  transportation  method  [21].  Problems 
4.7  and  4.8  were  inspired  by  the  presentation  in  [26].  Problem  4.9  is  from  [76]. 

It  is  also  possible  to  prove  Talagrand’s  concentration  inequality  indirectly 
(through  its  isoperimetric  form)  using  log-Sobolev  methods;  see  [13]. 

§4.4.  That  the  T2-inequality  suffices  for  dimension-free  Euclidean  transporta¬ 
tion  was  noted  by  Talagrancl  [85] .  Problem  4. 10  follows  the  proof  in  [85]  that 
the  Gaussian  measure  satisfies  the  T2-inequality.  The  stochastic  calculus  proof 
of  Problem  4.11  is  taken  from  [24].  Theorem  4.31  is  due  to  Gozlan  [41].  Sanov’s 
theorem  is  a  classical  result  in  large  deviations  theory  [22];  the  proof  given 
here  was  taken  from  lecture  notes  by  Varadhan.  Problem  4.13  is  from  [71]. 
The  connection  between  concentration  and  inf-convolutions  is  due  to  Maurey 
[57];  Problem  4.14  follows  the  presentation  in  [50]. 
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We  have  shown  in  the  previous  chapters  that  in  many  cases  a  function 
f(X l,  . . . ,  Xn)  of  i.i.d.  random  variables  is  close  to  its  mean  E[/(Xi, . . . ,  Xn)}. 
The  concentration  phenomenon  says  nothing,  however,  about  the  magnitude 
of  the  mean  E[/(Xl,  . . .  ,Xn)\  itself.  One  cannot  hope  to  address  such  ques¬ 
tions  at  the  same  level  as  generality  as  we  investigated  concentration:  some 
additional  structure  is  needed  in  order  to  develop  any  meaningful  theory. 

The  type  of  structure  that  will  be  investigated  in  the  sequel  are  suprema 

F  =  supXt, 

teT 

where  {Xt}teT  is  a  random  process  that  is  defined  on  some  index  set  T.  Such 
problems  arise  in  numerous  high-dimensional  applications,  such  as  random 
matrix  theory  and  probability  in  Banach  spaces,  control  of  empirical  processes 
in  statistics  and  machine  learning,  random  optimization  problems,  etc.  It  is 
typically  the  case  that  the  distribution  of  individual  Xt  is  well  understood,  so 
that  the  main  difficulty  lies  in  understanding  the  effect  of  the  supremum.  To 
this  end,  we  formulated  in  Chapter  1  the  following  informal  principle: 

t£T  is  “sufficiently  continuous,”  the  magnitude  of  supigTXt  is 
controlled  by  the  “ complexity  ”  of  the  index  set  T. 

In  the  sequel,  we  proceed  to  make  this  informal  idea  precise. 


5.1  Finite  maxima 

Before  we  can  develop  a  general  theory  to  control  suprema  of  random  pro¬ 
cesses,  we  must  understand  the  simplest  possible  situation:  the  maximum  of 
a  finite  number  of  random  variables,  that  is,  the  case  where  the  index  set  T 
has  finite  cardinality  \T\  <  oo.  In  fact,  this  special  case  will  form  the  most 
basic  ingredient  of  our  theory.  To  develop  a  more  general  theory,  the  funda¬ 
mental  idea  in  the  sequel  will  be  to  approximate  the  supremum  over  a  general 
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index  set  by  the  maximum  over  a  finite  set  in  increasingly  sophisticated  ways. 
By  appropriately  combining  these  two  basic  ingredients — finite  maxima  and 
approximation — we  will  develop  powerful  tools  that  yield  remarkably  sharp 
control  over  the  suprema  of  many  random  processes. 

How  can  one  bound  the  maximum  of  a  finite  number  of  random  variables? 
The  most  naive  approach  imaginable  is  to  bound  the  supremum  by  a  sum: 

sup  Xt  <  Y  \Xt\. 
tGT  teT 

Plugging  this  trivial  fact  into  an  expectation,  we  obtain 


E 


sup  Xt 
.  teT 


<  \T\  supE|Xt|. 
teT 


Thus  if  we  can  control  the  magnitude  of  every  random  variable  Xt  individually, 
then  we  obtain  a  bound  that  grows  linearly  in  the  cardinality  |T|. 

Of  course,  bounding  a  maximum  by  a  sum  is  an  exceedingly  crude  idea, 
and  it  seems  unlikely  a  priori  that  one  could  draw  any  remotely  accurate 
conclusions  from  such  a  procedure.  Nonetheless,  this  simple  idea  is  not  a  bad 
as  it  may  appear  on  first  sight  if  we  use  it  a  bit  more  carefully.  Suppose,  for 
example,  that  the  random  variables  Xt  have  bounded  pth  moment.  Then 


E 


sup  Xt 
.  teT 


i/p 


<  E 


sup  \Xt\p 
.  teT 


<  |T|1/p  supE^H1^, 

teT 


where  we  have  bounded  the  maximum  by  a  sum  after  applying  Jensen’s  in¬ 
equality.  This  has  significantly  improved  the  dependence  on  the  cardinality 
from  |T|  to  |1Z_'|  1/?>.  Evidently  our  control  of  the  maximum  of  random  variables 
is  closely  related  to  the  tail  behavior  of  these  random  variables:  the  thinner 
the  tails  (i.e.,  the  larger  p),  the  better  we  can  control  their  maximum.  Once 
this  idea  has  been  understood,  however,  there  is  no  need  to  stop  at  moments: 
if  the  random  variables  Xt  possess  a  finite  moment  generating  function,  we 
can  apply  an  exponential  transformation  precisely  as  in  the  development  of 
Chernoff  bounds  in  section  3.1  to  estimate  the  maximum. 


Lemma  5.1  (Maximal  inequality).  Suppose  that  logE[eAAt]  <  ?/>(A)  for 
all  A  >  0  and  t  £  T,  where  ip  is  convex  and  tp{ 0)  =  tp'(0)  =  0.  Then 


E 


sup  Xt 
.  teT 


where  tp*{x)  =  supA>0{Aa:  —  tp(A)}  denotes  the  Legendre  dual  of  the  function 
ip.  In  particular,  if  Xt  is  a2 -subgaussian  for  every  t  £T,  we  have 


sup  Xt 
teT 


E 


<  V2<r2  login 
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Proof.  By  Jensen’s  inequality,  we  have  for  any  A  >  0 


E 


sup  Xt 
teT 


<  —  log  E[eAsup‘6T  Xt\  <  -  log  VE[eAXt]  < 
A  A 

teT 


A 


iog|r|  +  ^(A) 
A 


As  A  >  0  is  arbitrary,  we  can  now  optimize  over  A  on  the  right  hand  side.  In 
the  special  case  that  Xt  is  <r2-subgaussian  (so  that  ^>(A)  =  A2ct2/2),  we  obtain 


E 

sup  Xt 

<  inf 

.  teT 

A>0 

log  \T\ 
A 


a2  A 


sj2o'1  log|T|. 


In  the  general  case,  the  only  difficulty  is  to  evaluate  the  infcnum  in 


E 


sup  Xt 
teT 


<  inf  iQgm+V’W 

A>0  A 


Suppose  ip*  is  invertible.  Note  that  { ip*(z )  +^(A)}/A  >  2  for  all  A  >  0  by  the 
definition  of  ip*,  and  that  the  inequality  is  attained  if  we  choose  A  to  be  the 
optimizer  in  the  definition  of  ip*.  Setting  ip*(z)  =  log  |T|  yields  the  conclusion. 

It  remains  to  show  that  that  ip*  is  invertible.  As  ip*  is  the  supremum  of 
linear  functions,  x  i— >  ip*{x)  is  convex  and  strictly  increasing  except  at  those 
values  x  where  the  maximum  in  the  definition  of  ip*  is  attained  at  A  =  0, 
that  is,  when  Xx  —  ip( A)  <  —ip(0)  for  all  A  >  0.  By  the  first-order  condition 
for  convexity,  the  latter  occurs  if  and  only  if  x  <  ip'( 0)  =  0.  Moreover,  as 
ip*{ 0)  =  0,  we  conclude  that  x  i— >  ip*(x)  is  convex,  strictly  increasing,  and 
nonnegative  for  x  >  0.  Thus  the  inverse  ip*~1(x)  is  well  defined  for  x  >  0.  □ 


Lemma  5.1  should  be  viewed  as  an  analogue  of  the  Chernoff  bound  of 
Lemma  3.1  in  the  setting  of  maxima  of  random  variables.  Recall  that  the 
Chernoff  bound  states  that  if  log  E[eAAf]  <  ip{ A)  for  all  A  >  0  and  t  €  T,  then 

P[Xt  >  x]  <  e~r(x)  for  all  x  >  0,  teT. 


Thus  our  bound  on  the  magnitude  of  the  maximum  depends  on  |T|  as  the 
inverse  of  the  tail  probability  of  the  individual  random  variables  (as  the  inverse 
of  the  function  e is  ^*_1(log x)).  This  is  not  a  coincidence.  In  fact,  we 
can  use  the  Chernoff  bound  directly  to  estimate  the  tail  probabilities  of  the 
maximum  (rather  than  the  expectation  as  in  Lemma  5.1)  as  follows. 

Lemma  5.2  (Maximal  tail  inequality).  Suppose  that  logE  [eXXt]  <  ip( A) 
for  all  A  >  0  and  teT,  where  ip  is  convex  and  ip{  0)  =  ip' { 0)  =0.  Then 


sup Xt  >  iP*-\log\T\  +  u) 
teT 


<  e  u  for  all  it  >  0. 


In  particular,  if  Xt  is  a2  -subgaussian  for  every  teT,  we  have 


sup  Xt  >  \j2a2  log  jTj  +  x 
teT 


<  e~x2/2,J 


2 


P 


for  all  x  >  0. 
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Proof.  We  readily  estimate  using  the  Chernoff  bound 


p 

supX^  >  x 

=  p 

i - 

B 

Al 

S 

l _ 

.  teT 

L£<ET 

<  ^P[Xt  >  x]  <  elo«lTl-^*W. 

teT 


Writing  u  =  —  log  \T\  yields  the  first  inequality  (the  invertibility  of  ijj* 

was  shown  in  the  proof  of  Lemma  5.1).  In  the  subgaussian  case, 

V>*_1(logT  +  u)  =  \/2cr2(log  jrj  +  u)  <  \J2 a2  log  |Tj  +  V2a2u 


yields  the  second  inequality.  □ 

The  argument  used  in  the  proof  of  Lemma  5.2  is  called  a  union  bound: 
we  have  estimated  the  probability  of  a  union  of  events  by  the  sum  of  the 
probabilities  P[A  U  B]  <  P[A]  +  P[R].  This  crude  estimate  plays  exactly 
the  same  role  in  the  proof  of  Lemma  5.2  as  does  bounding  the  maximum  of 
random  variables  by  their  sum  in  the  proof  of  Lemma  5.1. 

Remark  5.3.  While  this  may  not  be  evident  at  the  outset,  the  proofs  of  Lem¬ 
mas  5.1  and  5.2  are  based  on  precisely  the  same  idea.  Indeed,  the  union  bound 
is  merely  another  example  of  bounding  a  maximum  by  a  sum: 


P[Ai  U  •  •  •  U  An]  —  E[max{lJ41, . . . ,  1  An}}  <  +  •  •  •  +  Efl^]. 

Lemmas  5.1  and  5.2  are  therefore  ultimately  implementing  the  same  bound 
in  a  slightly  different  way.  In  fact,  is  not  difficult  to  deduce  a  form  of  Lemma 
5.1  with  a  slightly  worse  constant  directly  from  Lemma  5.2  by  integrating  the 
tail  bound,  that  is,  using  E [Z]  =  /0°°  P  [Z  >  z\  dz  for  Z  >  0. 

We  have  obtained  above  some  simple  bounds  on  the  maximum  of  a  finite 
number  of  random  variables.  How  good  are  these  bounds?  There  are  several 
reasons  to  be  suspicious.  On  the  one  hand,  we  have  obtained  our  estimates  in 
an  exceedingly  crude  fashion  by  bounding  a  maximum  by  a  sum.  On  the  other 
hand,  while  we  made  assumptions  about  the  tail  behavior  of  the  individual 
variables  Xt,  we  made  no  assumptions  of  any  kind  about  the  joint  distribu¬ 
tion  of  {Xt}t.£T-  One  would  expect  that  dependencies  between  the  random 
variables  Xt  to  make  a  significant  difference  to  their  maximum.  As  an  ex¬ 
treme  example,  suppose  {Xt}tGT  are  completely  dependent  in  the  sense  that 
Xt  =  Xs  for  all  t,s  €  T.  Then  E[supt  Xt]  =  E[XS]  does  not  depend  on  |Tj  at 
all,  whereas  the  bound  in  Lemma  5.1  necessarily  grows  with  |T|.  Of  course, 
there  is  no  contradiction:  Lemma  5.1  is  correct,  but  is  evidently  far  from  sharp 
in  the  presence  of  strong  dependence  between  the  random  variables  Xt. 

Remarkably,  however,  Lemmas  5.1  and  5.2  prove  to  be  essentially  sharp 
when  the  random  variables  {Xt}teT  are  independent.  It  is  perhaps  surprising 
that  a  method  as  crude  as  bounding  a  maximum  by  a  sum  would  lead  to  a 
sharp  result  in  any  nontrivial  situation.  However,  it  turns  out  that  this  idea  is 
not  as  bad  as  may  be  expected  on  first  sight  in  the  presence  of  independence. 
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For  example,  consider  the  union  bound  P[HUf3]  <  P[^4]+P[B].  Equality  holds 
when  A  and  B  are  disjoint,  but  this  is  certainly  not  the  case  in  the  proof  of 
Lemma  5.2.  Nonetheless,  when  A  and  B  are  independent,  the  probability  that 
they  occur  simultaneously  is  much  smaller  than  the  individual  probabilities, 
so  that  we  still  have  P[dUB]  >  P  [H]  +  P [B] .  This  idea  will  be  exploited  in 
Problem  5.1  below  to  show  that  Lemmas  5.1  and  5.2  are  essentially  sharp  in 
the  independent  case.  When  viewed  in  terms  of  a  sum  of  random  variables, 
we  see  that  in  this  setting  the  sum  is  dominated  by  its  largest  term,  so  that 
approximating  the  maximum  by  a  sum  is  not  such  a  bad  idea  after  all. 


Problems 


5.1  (Maxima  of  independent  random  variables).  The  proofs  of  the  max¬ 
imal  inequalities  in  the  present  section  rely  on  a  very  crude  device:  bounding 
the  maximum  of  random  variable  by  a  sum.  Nonetheless,  when  the  random 
variables  are  independent,  the  bounds  we  obtain  above  are  often  sharp.  To 
understand  why,  we  must  prove  lower  bounds  of  the  same  order. 

It  is  easiest  to  consider  first  the  setting  of  Lemma  5.2.  Let  us  begin  by 
proving  matching  upper  and  lower  union  bounds  for  independent  events. 

a.  Show  that  if  A\ , . . . ,  An  are  independent  events,  then 


(1-e-1)  lA^P[Hfc]  <P 


1 


IK 

k= 1 


<1A^P[H* 


k= 1 


Hint:  JIfe=i{l  —  xk}  <  exp(—  Y^k=\  xk)  and  1  —  e  x  >  (1  -  e  J)  1  A  x. 


b.  Let  77*  be  a  strictly  increasing  convex  function,  and  suppose  that 
P[Xt  >  x]  >  e-7'*(x)  for  all  x  >  0,  t  G  T. 


Conclude  that  for  u  >  0 


P 


sup  Xt  >  ry*  1(log  \T\  +  u ) 
.  teT 


>(i 


) e 


and  compare  with  the  corresponding  upper  bound  in  Lemma  5.2. 

Now  that  we  have  obtained  a  lower  bound  on  the  tail  probability  of  the  max¬ 
imum  (corresponding  to  the  upper  bound  of  Lemma  5.2),  we  can  obtain  a 
lower  bound  on  the  expectation  of  the  maximum  (corresponding  to  the  upper 
bound  of  Lemma  5.1)  by  integrating  the  tail  bound. 

c.  Deduce  from  the  previous  part  that  for  x  >  0 


P 


sup  Xt  >  77*  1(2  log  |T|)/2  +  x 

t£T 


>  (l-e-1)e"I'*(2x)/2. 


Hint:  use  concavity  of  77*  1 
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d.  Conclude  that  if 

e_T?*(x)  <  P[Xt  >  x]  <  e~r{x)  for  all  x  >  0,  t  e  T, 


then  we  have 


1  —  e  1 


2 


?7*-1(2  log  |T|)  +  sup  E[0  A  Xt\  <  E 
teT 


sup  Xt 
teT 


<  '0*-1(iog  |r|). 


Hint:  use  E[0  V  Z]  =  /0°°  P [Z  >  x\dx. 

The  upper  and  lower  bound  in  the  previous  part  are  generally  of  the  same 
order,  provided  that  we  start  with  upper  and  lower  bounds  on  P[Xt  >  x\  of 
the  same  order.  For  example,  let  us  consider  the  case  of  Gaussian  variables. 

e.  For  X  ~  X(0, 1),  show  that 


P[X  >  x]  >  for  all  x  >  0. 

Hint:  write  the  probability  as  an  integral  and  use  (v  +  x)2  <  2v2  +  2x2. 

f.  Let  Xi, . . . , Xn  be  i.i.d.  Gaussian  random  variables  with  zero  mean  and 
unit  variance.  Show  that  the  above  bound  implies 

- - —  \/2  log  ?r2-3/4 - -j=  <  E  max  X; 

2  V27T  L  i-n 

In  particular,  C\J log n  <  E[iriaxj<n  X,]  <  C\/\ogn  for  n  sufficiently  large. 

g.  If  X1?X2, . . .  are  i.i.d.  Gaussian,  prove  the  asymptotic 

max,.  ;  „  A,  n^oo  i 

—  = - >  1  m  probability. 

V  2  log  n 

Hint:  for  the  upper  bound,  see  Problem  3.5.  For  the  lower  bound,  proceed 
analogously  using  a  suitable  improvement  on  the  Gaussian  tail  lower  bound 
obtained  above  (use  (v  +  x)2  <  (1  +  e_1)u2  +  (1  +  e)x2). 

5.2  (Approximating  a  maximum  by  a  sum).  Show  that  for  A  >  0 

max  Xf  <  y  log  V  eXXt  <  max  Xt  +  ^  ^  . 
ter  A  '  ter  A 

teT 

Thus  when  A  is  large,  the  sum  is  increasingly  dominated  by  its  largest  term. 
This  simple  observation  is  often  useful  in  problems  where  a  smooth  approxi¬ 
mation  of  the  maximum  function  x  1— >  max,  x,  is  needed. 

5.3  (Johnson-Lindenstrauss  lemma).  The  following  functional  analysis 
result  has  found  many  applications  in  computer  science  and  signal  processing. 


<  v/2Tog  n. 
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Let  x\, . . . ,  xn  be  points  in  a  Hilbert  space  H.  Then  for  every  0  <  £  <  1 
and  k  >  e~2  log  n,  there  exists  a  linear  map  T  :  H  — >  such  that 

(l—e)\\xi  —  Xj\\  <  \\Txi  —  Txj\\  <  (l+e)||a;,  —  Xj\\  for  all  1  <  i,j  <  n. 

This  result  should  interpreted  in  terms  of  compression:  if  we  want  to  store 
the  distances  between  n  points  in  a  data  structure,  and  if  we  tolerate  a  small 
distortion  of  order  e,  it  suffices  to  store  an  n  x  k  matrix  of  size  ~  n  log  n  rather 
than  the  full  n  x  n  distance  matrix  of  size  ~  n2. 

At  first  sight,  the  Johnson-Lindenstrauss  lemma  has  nothing  to  do  with 
probability:  it  is  a  deterministic  statement  about  the  geometry  of  Hilbert 
spaces.  However,  the  easiest  way  to  find  T  is  to  select  it  randomly! 

a.  Argue  that  we  can  assume  without  loss  of  generality  that  H  =  Rn. 

b.  For  a  k  x  n  random  matrix  T  such  that  Ty  are  i.i.d.  1V(0,  A-1),  show  that 

P[|||Tz||  -  E||Tz|||  >  e||z||]  <  2e"fce 2/2  for  2  e  R". 

Hint:  Gaussian  concentration. 

c.  Show  that 

Vi~k-i\\z\\<nT4<\\z\\, 

and  conclude  that  for  0  <  e  <  1  and  k  > 

P[(l  -  e)||z||  <  ||T«||  <  (1  +  e)\\z\\]  >  1  -  2e-fce2/8  for  2  £  R". 

Hint:  Use  E||Tz||  <  EfUT^H2]1/2  for  the  upper  bound.  For  the  lower  bound, 
estimate  Var||Tz||  from  above  using  the  Gaussian  Poincare  inequality. 

d.  Show  that  if  k  >  24s~2  log  n,  then 

P[(l  —  £)||a:j  —  Xj\\  <  || Txi  —  Txj\\  <  (1  +  e)\\xi  —  Xj\\  for  all  i,j]  >  0. 
Hint:  use  a  union  bound. 


5.2  Covering,  packing,  and  approximation 

If  the  set  T  is  infinite,  the  maximal  inequalities  of  the  previous  section  pro¬ 
vide  no  information.  This  is,  however,  not  surprising.  We  have  seen  that  the 
inequalities  for  finite  maxima  work  well  when  the  random  variables  are  inde¬ 
pendent.  On  the  other  hand,  suppose  that  T  is  infinite  but  that  t  t— >  Xt  is 

continuous  in  a  suitable  sense.  Then  liiti^ _ ,s  Xt  =  Xs,  so  Xt  and  Xs  must  be 

strongly  dependent  when  t  and  s  are  nearby  points!  Thus  the  lack  of  inde¬ 
pendence  should  in  fact  help  us  to  control  the  infinite  supremum:  we  should 
apply  the  maximal  inequalities  of  the  previous  section  only  to  a  finite  number 
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of  well-separated  points  (at  which  the  process  might  be  expected  to  be  nearly 
independent),  and  use  continuity  to  control  the  fluctuations  of  the  remaining 
(strongly  dependent)  degrees  of  freedom.  In  this  section,  we  will  develop  the 
crudest  illustration  of  this  principle,  which  will  be  systematically  developed 
in  the  sequel  into  a  powerful  machinery  to  control  suprema. 

To  implement  the  above  idea,  we  need  to  have  a  quantitative  notion  of 
continuity.  In  this  section,  we  will  use  the  simplest  (but,  as  we  will  see,  often 
unsatisfactory)  such  notion  for  random  processes. 

Definition  5.4  (Lipschitz  process).  The  random  process  {Xt}tex  is  called 
Lipschitz  for  a  metric  d  on  T  if  there  exists  a  random  variable  C  such  that 

\Xt  —  Xs\  <  Cd{t ,  s)  for  all  t,s  G  T. 

Given  a  Lipschitz  process,  our  aim  is  to  approximate  the  supremum  over 
T  by  the  maximum  over  a  finite  set  N,  to  which  we  will  apply  the  inequalities 
of  the  previous  section.  To  obtain  a  good  bound,  we  have  two  competing 
demands:  on  the  one  hand,  we  would  like  the  set  N  to  be  as  small  as  possible 
(so  that  the  bound  on  the  maximum  is  small);  on  the  other  hand,  to  control 
the  approximation  error,  we  must  make  sure  that  every  point  in  T  is  close  to 
at  least  one  of  the  points  in  N.  This  leads  to  the  following  concept. 

Definition  5.5  (e-net  and  covering  number).  A  set  N  is  called  an  e-net 
for  ( T,d )  if  for  every  t  G  T,  there  exists  n (t)  G  N  such  that  d(t,n(t))  <  e. 
The  smallest  cardinality  of  an  e-net  for  (T,  d)  is  called  the  covering  number 

N(T,d,e)  :=inf{|AT|  :  N  is  an  e-net  for  (T,  d)}. 

The  covering  number  N(T,  d,  e)  should  be  viewed  as  a  measure  of  the 
complexity  of  the  set  T  at  the  scale  e:  the  more  complex  T,  the  more  points 
we  will  need  to  approximate  its  structure  up  to  a  fixed  precision.  Alternatively, 
we  can  interpret  the  covering  number  as  describing  the  geometry  of  the  metric 
space  (T,  d).  Indeed,  let  B{t,  e)  =  {s  :  d(t,  s )  <  e}  be  a  ball  of  radius  e.  Then 

N  is  an  e-net  if  and  only  if  T  C  u  B(t,e), 

teN 

so  that  the  covering  number  N (T,  d,  e)  is  the  smallest  number  of  balls  of  radius 
e  needed  to  cover  T  (hence  the  name).  We  can  therefore  interpret  the  covering 
number  as  a  measure  of  the  degree  of  (non-)compactness  of  ( T,d ). 

Remark  5.6.  In  many  applications,  we  may  want  to  compute  the  supremum 
supteT  Xt  of  a  stochastic  process  {At}t6,g  that  is  defined  on  a  larger  index 
set  S  D  T.  In  this  case,  even  though  we  are  only  interested  in  the  process  on 
the  set  T,  it  is  not  necessary  to  require  that  the  e-net  N  is  a  subset  of  T:  it 
can  be  convenient  to  approximate  the  set  T  by  points  in  S\T  also.  For  this 
reason,  we  have  not  insisted  in  the  above  definition  that  N  CT. 
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We  are  now  ready  to  develop  our  first  bound  on  the  supremum  of  a  random 
process.  We  adopt  the  notation  of  Definitions  5.4  and  5.5. 

Lemma  5.7  (Lipschitz  maximal  inequality).  Suppose  {Xt}t£T  is  a  Lips- 
chitz  process  (Definition  5-4)  and  Xt  is  a2 -subgaussian  for  every  t  £  T.  Then 


E 


supXt  <  inf{eE[C]  +  \j2a2  log  N(T,d,e)}. 
ter  J  e>° 


Note  that  this  result  is  indeed  a  simple  incarnation  of  the  informal  principle 
formulated  in  Chapter  1:  if  the  process  Xt  is  “sufficiently  continuous,”  then 
suptgT  Xt  is  controlled  by  the  “complexity”  of  the  index  set  T. 

Proof.  Let  e  >  0  and  let  N  be  an  e-net.  Then 


sup Xt  <  sup{Xt  -  Xn(t)}  +  supXw(t)  <  Ce  +  max  Xt. 
ter  tgr  ter 

Taking  the  expectation  and  using  Lemma  5.1  yields 


E 


sup  Xt 

t£T 


<  eE[C]  +  \j2a2  log  \N\. 


Optimizing  over  e-nets  N  and  e  >  0  yields  the  result. 


□ 


Remark  5.8.  The  idea  behind  Lemma  5.7  is  that  it  allows  us  to  trade  off 
between  exploiting  independence  (better  at  large  scales)  and  controlling  for 
dependence  (worse  at  large  scales).  However,  note  that  we  never  explicitly 
assume  or  use  independence  in  the  proof:  instead,  the  distance  d  could  be 
interpreted  as  a  proxy  for  the  degree  of  independence.  While  the  conclusion 
of  Lemma  5.7  does  not  depend  on  this  validity  of  this  interpretation,  we 
expect  that  such  bounds  (and  the  more  powerful  bounds  to  be  developed  in 
the  sequel)  will  be  the  most  effective  when  the  distance  d  is  chosen  in  such  a 
way  that  large  distance  does  indeed  correspond  to  more  independence.  This 
is  often  the  case  in  practice.  In  the  case  of  Gaussian  processes,  for  example, 
we  will  see  in  the  next  chapter  that  this  idea  holds  to  such  a  degree  that  we 
can  obtain  matching  upper  and  lower  bounds  for  the  supremum  of  Gaussian 
processes  in  terms  of  the  geometry  of  the  index  set  (T,  d) ,  albeit  in  a  much 
more  sophisticated  manner  than  is  captured  by  the  trivial  Lemma  5.7. 

Remark  5.9.  When  N(T,  d,e)  =  oo,  the  bound  of  Lemma  5.7  is  infinite.  How¬ 
ever,  note  that  if  X±,X2,  ■  ■  ■  are  i.i.d.  unbounded  random  variables,  then  we 
already  have  supi  Xi  =  oo  a.s.  It  is  therefore  to  be  expected  that  the  supremum 
of  a  random  process  will  typically  indeed  be  infinite  if  it  contains  infinitely 
many  independent  degrees  of  freedom.  Thus  the  fact  that  N (T,  d,e)  =  oo 
(which  means  there  are  infinitely  many  points  in  T  that  are  well  separated) 
yields  an  infinite  bound  is  not  a  shortcoming  of  Lemma  5.7.  To  obtain  a  finite 
supremum  for  noncompact  index  sets  T  one  must  often  add  a  penalty  inside 
the  supremum;  such  problems  will  be  investigated  in  section  5.4  below. 
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In  the  remainder  of  this  section,  we  will  illustrate  the  application  of  Lemma 
5.7  using  two  illuminating  examples.  Along  the  way,  we  will  develop  some 
useful  examples  of  how  one  can  control  covering  numbers. 

Example  5.10  (Random  matrices).  Let  M  be  an  n  x  m  random  matrix  such 
that  Mt,j  are  independent  (72-subgaussian  random  variables.  We  would  like  to 
estimate  the  magnitude  of  the  operator  norm 

||M||  :=  sup  (v,Mw)=  sup  Xv<w, 

v£B£  ,wEB™  ( v,w)ET 

where  B%  =  {x  £  R”  :  ||a;||  <  1}  is  the  Euclidean  unit  ball  in  Rn  and 

n  m 

T:=B%x  B™,  XVtW  :=  (v,  Mw )  =  EE  ViMijWj. 

i=  1  3= 1 

It  follows  immediately  from  Azuma’s  inequality  (Lemma  3.7)  that  Xv>w  is 
cr2-subgaussian  for  every  ( v,w )  £  T.  On  the  other  hand,  note  that 

\XVtW  -  XvitW>\  =  |(t>,  Mw)  -  {vr,  Mw')\ 

<  |(u  —  vf ,  Mw) |  +  \(v',M(w  —  «/))l 
<||u-u'||||M||||u,||  +  ||u'||||Af||||u,- re'll 
<||M||{||u-u'||  + ||u,- re'll} 

for  (v,  w)  £  T.  If  we  define  a  metric  on  T  as 

d((v,w),(v',w'))  :=  1 1 rr  r/||  +  ||re-rr/||, 

we  see  that  the  random  process  is  Lipschitz  for  the  metric  d. 

Note  that  the  random  Lipschitz  constant  happens  to  be  ||M||,  which  is  in  fact 
the  quantity  we  are  trying  to  control  in  the  first  place!  This  is  a  rather  peculiar 
situation,  but  we  can  nonetheless  readily  apply  Lemma  5.7:  this  yields 

E[||M||]  <  eE[||M||]  +  \/2a2  log  N(T,  d,  e) 

for  every  e  >  0,  which  we  can  rearrange  to  obtain 

E[||M||]  <  inff^^log N(T,d,e). 

What  remains  is  to  estimate  the  covering  number.  To  this  end,  we  must  intro¬ 
duce  an  additional  idea  that  will  be  of  significant  importance  in  the  sequel. 

How  can  one  construct  a  small  e-net  N7  The  defining  property  of  an  e-net 
is  that  every  point  in  T  is  within  a  distance  at  most  e  of  some  point  in  N . 
We  can  always  achieve  this  by  choosing  a  very  dense  set  N.  However,  if  we 
want  |  TV  |  to  be  small,  we  should  intuitively  choose  the  points  in  N  to  be  as 
far  apart  as  possible.  This  motivates  the  following  definition. 
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Definition  5.11  (e-packing  and  packing  number).  A  set  N  CT  is  called 
an  e-packing  of  ( T,d )  if  d{t,t')  >  e  for  every  t,t'  £  N,  t  ^  if .  The  largest 
cardinality  of  an  e-packing  of  (T,d)  is  called  the  packing  number 

D(T,  d,  e)  :=  sup{|_/V|  :  N  is  an  e-packing  of  (T,  d)}. 

The  key  idea,  which  was  already  hinted  at  above,  is  that  the  notion  of 
packing  dual  to  the  notion  of  covering,  as  is  made  precise  by  the  following 
result.  This  means  that  we  can  use  covering  and  packing  interchangeably  (up 
to  constants).  In  some  cases  it  is  easier  to  estimate  packing  numbers  than 
covering  numbers,  as  we  will  see  shortly.  On  the  other  hand,  we  will  see  in 
the  following  chapter  that  packing  numbers  arise  naturally  when  we  aim  to 
prove  lower  bounds  for  the  suprema  of  random  processes  (as  opposed  to  upper 
bounds  which  are  considered  exclusively  in  this  chapter). 

Lemma  5.12  (Duality  between  covering  and  packing).  For  every  e  >  0 

D(T,  d,  2e)  <  N(T ,  d,  e)  <  D(T ,  d,  e). 

Note  that  this  can  indeed  be  viewed  as  a  form  of  duality  (in  the  sense  of 
optimization):  the  packing  number  is  defined  in  terms  of  a  supremum,  but  the 
covering  number  is  defined  in  terms  of  an  infimum. 

Proof.  Let  D  be  a  2e-packing  and  let  N  be  an  e-net.  For  every  t  £  D,  choose 
7T (t)  £  N  such  that  d(t,n(t))  <  e.  Then  for  t  ^  t' ,  we  have 

2e  <  d(t ,  t')  <  d(t ,  7r(t))  +  d(ir(t),  n(t'))  +  d(Tr(tf),t')  <  2e  +  d( 7r(f),  7r(f')), 

which  implies  n(t)  ^  7r(t').  Thus  n  :  D  — ■>  TV  is  one-to-one,  and  therefore 
\D\  <  |AT|.  This  yields  the  first  inequality  D(T,d,  2e)  <  N(T,d,e). 

To  obtain  the  second  inequality,  let  I?  be  a  maximal  e-packing  of  (T,  d) 
(that  is,  \D\  =  D(T,d,e)).  We  claim  that  D  is  necessarily  an  e-net.  Indeed, 
suppose  this  is  not  the  case;  then  there  is  a  point  t  £  T  such  that  d(t,  t')  >  e 
for  every  t'  £  D.  But  then  DU{f}  must  be  a  e-packing  also,  which  contradicts 
the  maximality  of  D.  Thus  we  have  D(T,  d,e)  =  \D\  >  N(T,  d,e).  □ 

We  are  now  in  a  position  to  bound  the  covering  number  of  the  Euclidean 
ball  Blf  with  respect  to  the  Euclidean  distance.  The  proof  of  this  elementary 
result  uses  a  clever  technique  known  as  a  volume  argument. 

Lemma  5.13.  We  have  ||  •  ||,e)  =  1  for  e  >  1  and 

Q)  <N(B2,\\-\\,e)<  0)  for  0  <  e  <  1. 

Proof.  That  N(Blf,  ||  •  ||,e)  =  1  for  £  >  1  is  obvious:  by  definition,  we  have 
||£||  =  || t  —  0||  <  1  for  every  t  £  Blf,  so  the  singleton  {0}  is  an  £-net. 

The  main  part  of  the  proof  is  illustrated  in  the  following  figure: 
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The  colored  ball  is  B%-  To  obtain  an  upper  bound  on  the  covering  number,  we 
choose  a  2e-packing  D  of  -BJ  (black  dots  in  left  figure).  Then  balls  of  radius 
e  around  t  £  D  be  disjoint,  and  all  these  balls  are  contained  in  a  large  ball  of 
size  1  +  e.  As  the  sum  of  the  volumes  of  the  small  balls  (of  which  there  are 
|D|)  is  bounded  above  by  the  volume  of  the  large  ball,  we  obtain  an  upper 
bound  on  the  size  of  D  (and  thus  on  the  covering  number  by  Lemma  5.12). 
To  obtain  a  lower  bound  on  the  covering  number,  we  choose  an  e-net  N  of 
BJ  (black  dots  in  right  figure).  As  the  balls  of  radius  e  around  t  £  N  cover 
iT? ,  the  sum  of  the  volumes  of  these  balls  (of  which  there  are  \N\)  is  bounded 
below  by  the  volume  of  B%.  This  yields  a  lower  bound  on  the  size  of  N. 

We  now  proceed  to  make  this  argument  precise.  Let  us  begin  with  the 
upper  bound.  Let  D  be  a  2e-packing  of  B%.  As  >  2e  for  all  t  ^  i!  in 

D ,  the  balls  {B(t,e)  :  t  £  Dj  must  be  disjoint.  On  the  other  hand,  every  ball 
B(t ,  e)  for  t  £  BJ  must  be  contained  in  the  larger  ball  B( 0, 1  +  e).  Thus 

E  A (fl(i,  e))  =  A  (  |J  B(t,  e)  ]  <  A(B(0, 1  +  e)), 

ter>  \te  D  ) 

where  A  denotes  the  Lebesgue  measure  on  Rra.  By  homogeneity  of  the  Lebesgue 
measure,  A (B(t,a))  =  X(B(0,a))  =  A(aB(0,l))  =  anX(B(0, 1)).  Thus 

A(B(0, 1  +  e))  _  / 1  +  e\n 
1  A(B(0,  e))  \  e  )  ■ 

As  this  holds  for  every  2e-packing  D ,  we  have  evidently  proved  the  upper 
bound  N(T,  d,  2e)  <  D(T,  d,  2e)  <  (1  +  l/e)n  <  (3/2e)n  for  2e  <  1. 

To  obtain  the  lower  bound,  let  N  be  an  e-net  for  B%.  Then 


MB  2)  <  X 


U  B(t,e) 

teN 


EA 

teN 


so  we  obtain 


\N\> 


A  (g2") 
A(B(0,e)) 


n 


As  this  holds  for  every  e-net  N,  we  have  proved  N(T,d,e)  >  (1/e)". 


□ 
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Remark  5.1 Lemma  5.13  quantifies  explicitly  the  dependence  of  the  covering 
number  on  dimension:  the  number  of  balls  of  radius  e  needed  to  cover  a  ball 
in  R"  is  polynomial  in  1/e  of  order  n.  This  is  not  surprising:  think  of  how 
many  cubes  of  side  length  e  can  fit  into  the  unit  cube  in  R".  While  balls  do 
not  pack  as  nicely  as  cubes,  the  ultimate  conclusion  is  the  same  (in  fact,  the 
conclusion  of  Lemma  5.13  carries  over  to  any  norm  on  R™,  see  Problem  5.5). 
In  this  manner,  the  dependence  on  dimension  will  enter  explicitly  into  our 
estimates  of  the  suprema  of  random  processes. 

Beyond  the  concrete  result  on  covering  numbers  in  R",  Lemma  5.13  pro¬ 
vides  a  good  way  to  think  about  the  notion  of  dimension  in  the  first  place. 
The  classical  idea  that  R”  is  n-dimensional  stems  from  its  linear  structure: 
there  is  a  basis  of  size  n  such  that  any  vector  in  R"  can  be  written  as  a  linear 
combination  of  these  basis  elements.  This  linear-algebraic  notion  of  dimension 
is  not  very  useful  in  general  spaces  where  one  does  not  need  to  have  any  linear 
structure.  Lemma  5.13  motivates  a  different  notion  of  dimension  that  makes 
sense  in  any  metric  space:  we  say  that  a  metric  space  (T,  d)  has  metric  dimen¬ 
sion  n  if  N(T,d,e)  ~  s~n.  Lemma  5.13  shows  that  for  (bounded  subsets  of) 
R",  the  linear-algebraic  and  metric  notions  of  dimension  coincide;  however, 
the  definition  of  metric  dimension  is  independent  of  the  linear  structure  of  the 
space.  The  notion  of  metric  dimension  certainly  conforms  to  the  intuitive  no¬ 
tion  that  a  high-dimensional  space  has  more  “room”  than  a  low-dimensional 
space  (the  number  of  balls  of  fixed  radius  needed  to  cover  the  space  increases 
exponentially  in  the  dimension).  Of  course,  not  every  metric  space  has  fi¬ 
nite  metric  dimension:  we  will  shortly  encounter  an  infinite-dimensional  space 
(T,  d )  for  which  the  covering  numbers  grow  exponentially  in  1  /e. 

Having  developed  some  basic  estimates,  we  can  now  complete  the  example 
of  random  matrices.  Here  we  are  not  interested  in  the  covering  number  of  B% 
itself,  but  rather  in  the  covering  number  of  T  =  B%  x  B™  with  respect  to  the 
metric  d.  The  latter  is  however  easily  estimated  using  Lemma  5.13.  Let  N  be 
an  e-net  for  B%  and  let  M  be  an  e-net  for  B ™.  Then  N  x  M  is  a  2e-net  for 
T  of  cardinality  |AT||M|:  indeed,  setting  tt ((t,s))  =  7r(s)),  we  have 

d((t,s),7r((t,s)))  =  || t  -  7r(t)||  +  ||s  -  tt(s)||  <  2e. 

This  evidently  implies  that 

(o  \  n+m 

zj 

for  e  <  1.  We  therefore  obtain 

E[||M||]  <  inf  y~:  \Aog  N (T,  d,  e)  <  ay/n  +  m. 

It  turns  out  that  this  crude  bound  already  captures  the  correct  order  of 
magnitude  of  the  matrix  norm!  In  particular,  for  square  matrices,  we  obtain 
E[||M||]  <  y/n  as  was  already  alluded  to  in  Example  2.5. 
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We  now  turn  to  our  second  example.  Unlike  in  the  previous  example,  where 
we  got  a  sharp  result  with  little  work,  we  will  not  be  so  lucky  here:  we  will 
derive  a  nontrivial  bound  from  Lemma  5.7,  but  the  methods  we  developed  so 
far  will  prove  to  be  too  crude  to  capture  the  correct  order  of  magnitude. 

Example  5.15  (Wasserstein  law  of  large  numbers).  Let  Ai,A2,...  be  i.i.d. 
random  variables  with  values  in  the  interval  [0,1].  We  denote  their  distribution 
as  Xi  ~  p.  Define  the  empirical  measure  of  X\, . . . ,  Xn  as 

1  " 
k= 1 


Then  it  is  easy  to  estimate 

E \pnf  -  pf\  <  E[| Pnf  -  /i/I2]172  <  ^ • 

In  particular,  we  have  pnf  — >  pf  in  L 1  for  every  bounded  function  /:  this  is 
none  other  than  the  weak  law  of  large  numbers  with  the  optimal  n-1/2  rate. 

At  what  rate  the  law  of  large  numbers  p„  — »  p  hold  when  we  consider 
other  notions  of  distance  between  probability  measures?  In  this  spirit,  we  will 
presently  attempt  to  estimate  the  expected  Wasserstein  distance  E[Wi(/xn,  /./ ) ] 
between  the  empirical  measure  and  the  underlying  distribution.  Recall  that 

Wi(pni  p)  =  sup  {pnf  -  pf}  =  sup  Xf, 

/eLiP([o,i])  /e? 

where  we  have  defined 


Xf  :=  Pnf  -  Pf,  5  :=  {/  G  Lip([0, 1])  :  0  <  /  <  1}. 

Thus  this  question  reduces  to  controlling  the  supremum  of  a  random  process. 
(Note  that  |/(x)  —  f(y) |  <  \x  —  y\  <  1  for  /  €  Lip([0, 1])  and  x,y  G  [0, 1];  as 
Ay  is  invariant  under  adding  a  constant  to  /,  there  is  no  loss  of  generality  in 
restricting  the  supremum  to  functions  0  <  /  <  1  in  the  definition  of  W-\ .) 

We  begin  by  noting  the  trivial  estimate 


I  Xf  -  Xg\  =  I  pn(f  -g)-p(f-g)  |  <  2 1|  y  -  g||oo- 

Thus  the  process  {Xf}je j  is  Lipschitz  with  respect  to  the  uniform  distance 
on  3\  On  the  other  hand,  note  that  by  definition 


n 

*/  =  £ 


k= 1 


f(xk)  -  Pf 

n 


which  is  a  sum  of  i.i.d.  random  variables  with  values  in  the  interval  ,  ^]. 
Thus  Ay  is  --subgaussian  for  every  /  G  T  by  the  Azuma-Hoeffding  inequality 
(Lemma  3.6).  We  can  therefore  estimate  using  Lemma  5.7 
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E[Wi(Atn, /z)]  <  inf  |2e  +  yJ^logN(3,  ||  •  ||oo, e) 

To  proceed,  we  must  bound  the  covering  number  N(U,  ||  •  ||oo,e)- 

Lemma  5.16.  There  is  a  constant  c  <  oo  such  that 

N(fF,  ||  •  ||oo,e)  <  ec/£  for  e  <  ||  •  ||oo,e)  =  1  fore>  |. 

Remark  5.17.  Note  that,  unlike  in  the  case  of  a  Euclidean  ball  where  the 
covering  number  is  polynomial  in  1/e,  the  covering  number  of  the  family  IF  of 
Lipschitz  functions  is  exponential  in  1  /e.  This  indicates  that  the  metric  space 
(IF,  ||  •  ||oo)  is  in  fact  infinite-dimensional,  which  is  not  too  surprising. 

Proof.  Fix  £  >  0.  For  every  function  /  £  T,  we  will  construct  a  new  function 
7 r(/)  in  the  manner  illustrated  in  the  following  picture: 


To  be  precise,  we  approximate  /  :  [0, 1]  — >  [0, 1]  by  7 r(/)  :  [0, 1]  — >  [0, 1] 
defined  as  follows.  Partition  the  horizontal  axis  into  consecutive  nonoverlap¬ 
ping  intervals  Ii, . . . ,  I\2/e]  of  size  e/2  and  the  vertical  axis  into  consecutive 
nonoverlapping  intervals  Ji, . . . ,  J\i/e]  of  size  e.  We  then  define 


*■(/)(*) 


max  +  min 
2 


whenever  x  £  Ik,  /'(min  Ik)  £  Je.. 


That  is,  in  each  interval  on  the  horizontal  axis,  we  approximate  /  by  its  value 
at  the  left  endpoint  of  the  interval  rounded  to  the  center  of  the  interval  on  the 
vertical  axis  to  which  it  belongs.  By  construction,  the  set  N  =  {7r(/)  :  /  £  T} 
is  an  £-net:  indeed,  note  that  whenever  x  £  Ik  and  /(min/fc)  £  Jp,  we  have 


I, /Or)  -  7r(/)(a;)|  <  | f(x)  -  /(min  Jfc)|  + 


/(min  Ik)  - 


max  +  min  Ji 


max  Ji  -  min  Jt 

<  \x  -  mmifcj  -I - - -  <  £, 


2 
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where  we  have  used  the  Lipschitz  property  of  /  and  the  definition  of  Ik,  Je- 
(Note  that  N  %.  5?  but  this  is  not  a  problem,  cf.  Remark  5.6.) 

As  we  now  have  an  e-net  N,  it  remains  to  estimate  \N\.  The  most  naive 
bound  would  be  \N\  <  [1/e]^/^  <  oo,  but  we  can  do  somewhat  better  by 
taking  into  account  the  Lipschitz  property  of  the  functions  in  £F.  Note  that 

l7r(/)(min/fc)  -  7r(/)(min/fc+i)|  <  |/(min/fe)  -  f(mmlk+1)\  +  e  <  |e; 

As  the  possible  values  of  7 r(/)  can  only  differ  by  multiples  of  e,  this  implies 
that  7r(/)(min/fc+1)  —  7r(/)(min/fc)  €  {— e,0,e}.  Thus  7r(/)(0)  can  take  any 
of  [1/e]  different  values,  but  each  subsequent  interval  can  only  differ  from  the 
previous  one  in  three  different  ways.  This  implies  the  bound 

N(J,  ||  •  Hoo.e)  <  \N\  <  <  ec's 

for  some  constant  c  and  every  e  >  0.  On  the  other  hand,  as  ||/  —  3 ||oo  <  \  for 
every  /€  J,  we  clearly  have  N(J,  ||  •  ||cx>, e)  =  1  for  £  >  □ 

Having  estimated  the  covering  numbers  of  T,  we  can  now  readily  complete 
our  bound  on  the  convergence  rate  in  the  Wasserstein  law  of  large  numbers: 

E[Wi(/in,/i)]  <  inf  |2e+  j  <  n_1/3. 

Recall  that  the  rate  of  convergence  in  the  law  of  large  numbers  for  a  single 
function  is  E| (inf  —  pf\  n~x/2 ,  but  we  have  obtained  a  slower  rate  n-1/3 
when  we  consider  the  convergence  uniformly  over  Lipschitz  functions.  Is  this 
rate  sharp?  I  turns  out  that  this  is  not  the  case:  in  the  present  example,  we 
will  show  in  the  next  section  that  the  optimal  rate  is  actually  still  ~  nr1/2. 

Remark  5.18.  There  is  no  reason  to  expect,  in  general,  the  the  rate  of  conver¬ 
gence  uniformly  over  a  class  of  functions  will  be  the  same  as  that  for  a  single 
function.  The  fact  that  the  rate  still  turns  out  to  be  nr1/2  in  the  present 
setting  is  an  artefact  of  the  fact  that  we  are  working  in  one  dimension:  for 
random  variables  Xk  £  [0,  l]p  for  p  >  2,  the  optimal  rates  turn  out  to  be 
strictly  slower  than  n-1/2.  Nonetheless,  even  in  this  case,  the  method  we  have 
used  in  this  section  does  not  capture  the  correct  rate  of  convergence. 

The  method  that  we  have  used  in  this  section  to  control  the  suprema  or 
random  processes  is  too  crude  to  obtain  sharp  results  in  most  examples  of 
interest.  While  we  obtained  a  sharp  result  in  the  random  matrix  example, 
this  was  not  the  case  for  the  Wasserstein  law  of  large  numbers.  Unfortunately, 
the  situation  encountered  in  the  second  example  is  the  norm.  It  is  illuminating 
to  understand  in  what  part  of  the  proof  we  incurred  the  loss  of  precision:  this 
will  directly  motivate  the  more  powerful  approach  for  bounding  the  suprema 
of  random  processes  that  will  be  developed  in  the  next  section. 

The  approach  of  Lemma  5.7  relies  on  two  steps:  the  approximation  of  the 
supremum  by  a  finite  maximum,  and  the  estimation  of  the  finite  maximum 
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using  a  suitable  maximal  inequlity.  The  key  problem  with  this  approach  is 
that  we  have  approximated  the  supremum  by  a  maximum  in  an  extremely 
inefficient  manner  by  using  an  almost  sure  Lipschitz  property  of  the  process. 
Let  us  illustrate  this  in  the  second  example.  Here  the  Lipschitz  property  reads 

\Xf-Xg\  <  2||/-Sf||00  a.s. 

One  cannot  substantially  improve  on  this  bound  if  the  result  is  required  to 
hold  almost  surely.  On  the  other  hand,  we  can  easily  compute 

E|X/-Xs|<n"1/2||/-5||00. 

While  the  almost  sure  Lipschitz  constant  of  the  process  Xf  is  2,  we  see  that 
Xf  is  Lipschitz  on  average  with  Lipschitz  constant  n~ 1/2  <C  2:  that  is,  the 
typical  behavior  of  the  increments  | Xf  —  Xg\  is  much  better  than  their  worst- 
case  behavior!  One  can  therefore  readily  understand  why  using  the  almost 
sure  Lipschitz  property  incurs  a  significant  loss  in  our  estimates.  If  we  were 
to  naively  substitute  the  “typical”  Lipschitz  constant  n“1//2  rather  than  the 
“worst-case”  constant  2  in  the  above  computation,  we  would  indeed  obtain 
the  correct  n-1/2  rather  than  n-1/3  rate.  However,  the  almost  sure  Lipschitz 
property  was  crucial  in  order  to  control  the  approximation  error  in  Lemma 
5.7,  so  that  such  a  substitution  is  certainly  unjustified  at  this  point. 

Remark  5.19.  We  can  now  also  understand  why  the  crude  approach  of  Lemma 
5.7  proves  to  be  useful  in  the  random  matrix  example:  in  this  setting,  it 
so  happens  that  the  almost  sure  Lipschitz  constant  is  of  the  same  order  as 
the  supremum  that  we  are  trying  to  compute.  Therefore,  even  though  our 
approximation  is  inefficient,  this  does  not  affect  the  final  bound  except  in  the 
numerical  constant.  However,  this  situation  is  essentially  a  coincidence.  In  the 
Wasserstein  law  of  large  numbers  example,  the  almost  sure  Lipschitz  constant 
is  much  larger  than  the  supremum  of  interest,  so  that  the  inefficiency  in  our 
approximation  swamps  the  final  bound  that  we  obtain. 

The  basic  challenge  we  therefore  face  at  this  point  in  improving  the  ap¬ 
proach  of  Lemma  5.7  is  to  devise  a  method  of  approximation  that  only  uses  an 
“in  probability”  version  of  the  Lipschitz  property  that  can  capture  the  typical 
size  of  the  increments,  rather  than  an  a.s.  Lipschitz  property  that  captures  the 
worst  case.  In  the  next  section,  we  will  see  that  this  goal  can  be  accomplished 
by  using  a  powerful  technique  known  as  chaining. 

Problems 

5.4  (Tightness  of  Johnson-Lindenstrauss).  The  Johnson-Lindenstrauss 
lemma  proved  in  Problem  5.3  shows  that  any  n  points  in  a  Hilbert  space  H 
can  be  mapped  into  Rfc  with  k  >  log  n  while  distorting  the  distances  between 
them  by  at  most  a  constant  factor.  Show  that  k  >  log?i  is  in  fact  necessary. 
Hint:  show  that  the  image  of  n  orthonormal  vectors  aq, . . . ,  xn  in  H  under  a 
map  T  :  H  — »  that  nearly  preserves  distances  is  a  packing  of  a  ball  in  Rfc . 
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5.5  (Covering  norm-balls  in  R").  The  goal  of  this  problem  is  to  investigate 
Lemma  5.13  for  norms  other  than  the  Euclidean  norm. 

a.  Show  that  the  conclusion  of  Lemma  5.13  holds  in  any  finite-dimensional 
Banach  space:  that  is,  if  |  •  |  is  any  norm  on  R",  then  we  have 

Q  <N(B,\-\,e)<  (^J  for  0  <  e  <  1, 

where  B  denotes  the  unit  norm-ball  {i£  R"  :  |a;|  <  1}. 

b.  Show  that  in  the  special  case  n  =  1,  we  can  compute  exactly 

N(B,\-\,e)=  l-  . 

5.6  (Proper  covering  numbers).  In  our  definition  of  an  e-net  N  for  (T,  d), 
we  did  not  assume  that  N  C  T  (cf.  Remark  5.6).  It  can  happen  quite  naturally 
that  the  points  that  we  use  to  approximate  the  set  T  are  not  themselves  in 
T,  for  example,  see  the  proof  of  Lemma  5.16.  On  the  other  hand,  in  some 
applications,  it  may  be  convenient  to  require  that  N  C  T.  When  this  is  the 
case,  the  e-net  is  said  to  be  proper,  and  the  proper  covering  number  Npl(T,  d,  e) 
denotes  the  cardinality  of  the  smallest  proper  e-net.  Show  that 

N(T,  d,  e)  <  Npr(T,d,  e)  <  N{T,d,e/ 2), 

which  implies  that  the  assumption  of  properness  is  harmless  in  most  cases. 

5.7  (Parametric  classes).  Consider  a  function  f  :  O  x  X  R  such  that 

|  fe{x)  -  fo>(x)\  <  Cd(0, 9')  for  all  x  £  X 

for  some  metric  d  on  <9.  We  think  of  x  i— >  fg(x)  as  a  function  on  X  that  is 
parametrized  by  a  parameter  6  £  0.  Thus  it  makes  sense  to  define 

J={fg:8£0}. 

Show  that 

N(5,\\-\\oo,e)<N(0,d,e/C). 

Thus  the  covering  numbers  of  parametrized  classes  of  functions  that  are  Lip- 
schitz  in  the  parameter  can  be  controlled  by  the  covering  numbers  of  the 
parameter  space.  This  is  often  useful,  for  example,  in  parametric  statistics. 

5.8  (Wasserstein  LLN  in  higher  dimension).  The  goal  of  this  problem  is 
to  extend  Example  5.15  to  the  multidimensional  situation  where  X\,X2,  ■  ■  ■ 
are  i.i.d.  random  variables  with  values  in  the  cube  [0,  l]d. 

a.  Let  To  :=  {/  €  Lip([0,  l]d)  :  /( 0)  =  0}.  Show  that 

N{5 o,W-\\oo,e)<ec/£\ 

where  the  constant  c  depends  on  dimension  d  only. 

b.  What  upper  bound  on  the  rate  in  the  Wasserstein  law  of  large  numbers  in 
dimension  d  does  this  imply  using  the  crude  method  of  Lemma  5.7? 
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In  the  previous  section,  we  developed  a  simple  method  to  bound  the  supremum 
of  a  random  process  that  satisfies  the  Lipschitz  property  Xt  —  Xs  <  d(t,  s ) 
in  an  almost  sure  sense.  However,  we  have  seen  that  this  requirement  is  very 
restrictive:  in  many  cases,  the  typical  size  of  the  increments  Xt  —  Xs  is  much 
smaller  than  in  the  worst  case.  We  therefore  aim  to  develop  a  method  to  bound 
the  suprema  of  random  processes  that  only  requires  the  Lipschitz  property 
Xt  —  Xs  <  d(t,  s)  to  hold  in  probability  in  a  suitable  sense. 

To  understand  how  one  might  approach  this  problem,  let  us  recall  the 
basic  idea  behind  the  proof  of  Lemma  5.7.  If  N  is  an  e-net,  we  can  estimate 


E 

sup  Xt 

<  E 

sup  X^tt) 

+  E 

sup{Xt  -  X^t)} 

.  teT 

.  teT 

.  teT 

The  first  term  is  a  finite  maximum  that  can  be  controlled  by  the  maximal 
inequality  of  Lemma  5.1.  The  second  term  is  a  small  remainder:  each  variable 
inside  the  supremum  has  magnitude  of  order  e  by  the  Lipschitz  property  of  the 
process.  If  the  Lipschitz  property  holds  in  an  almost  sure  sense,  the  supremum 
drops  out  and  we  can  immediately  control  the  remainder  term. 

However,  if  the  Lipschitz  property  only  holds  in  probability,  we  cannot 
directly  control  the  remainder  term.  Indeed,  in  this  case  each  variable  inside 
the  supremum  has  “typical”  size  e;  however,  we  have  to  control  the  supremum 
of  many  such  variables,  whose  magnitude  can  be  much  larger  than  e  (e.g.,  the 
maximum  of  n  independent  N(0,cr2)  variables  is  of  order  crydogn  cr,  even 
though  each  variable  is  only  of  order  cr) .  Therefore,  in  this  case,  the  problem 
of  controlling  the  remainder  term  is  essentially  of  the  same  type  as  that  of 
controlling  the  original  supremum  of  interest.  Nonetheless,  we  expect  that  the 
remainder  term  is  smaller  than  the  original  supremum,  as  the  size  of  each 
variable  in  the  remainder  term  is  now  smaller.  To  shrink  the  remainder  term 
further,  we  can  approximate  it  once  again  by  a  finite  maximum  at  a  smaller 
scale.  For  example,  if  N'  is  an  e/2-net,  then  we  can  estimate 


E 


sup{Xt 

teT 


^7r(t)} 


<  E 


sup{X7r/(t) 

teT 


+  E 


sup{Xt 

teT 


-^ir'(t)} 


The  first  term  on  the  right  is  a  finite  maximum  that  can  be  controlled  by 
Lemma  5.1.  The  remainder  term  is  still  an  infinite  supremum,  but  now  each 
variable  inside  the  supremum  is  only  of  order  e/2:  that  is,  we  have  cut  the 
remainder  term  roughly  by  half.  The  key  idea  of  this  section  is  that  we  can 
repeat  this  procedure  over  and  over  again,  each  time  cutting  the  size  of  the 
remainder  term  roughly  by  half.  Let  us  investigate  this  idea  a  bit  more  sys¬ 
tematically.  For  each  k  >  0,  let  Nk  be  a  2“  -net  and  choose  nk(t)  £  Nk  such 
that  d(t,TTk(t))  <  2~fe.  Repeating  the  approximation  n  times,  we  obtain 
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•2~k 


E 

sup  Xt 

<  E 

supXw  (t) 

n 

+  EE 

/— - - s  ' 

SUP{^7rfc(t)  _  X nk_1(t )} 

.  t&T 

k= 1 

.  teT 

+  E 


~2-n 

sup{Xt  -  XWn(t)} 

teT 


The  remainder  term  is  now  a  supremum  of  variables  of  order  2~n .  Under  mild 
conditions,  the  remainder  term  will  disappear  if  we  let  n  — »  oo  without  having 
to  invoke  any  almost  sure  Lipschitz  property  of  the  process.  Thus  we  surmount 
the  inefficiency  of  Lemma  5.7  by  approximating  the  supremum  not  at  a  single 
scale,  but  at  infinitely  many  scales.  The  remaining  bound  is  now  an  infinite 
sum:  the  fcth  term  in  the  sum  is  a  finite  maximum  of  random  variables  at  the 
scale  2~k .  To  control  these  finite  maxima,  we  also  do  not  require  an  almost 
sure  Lipschitz  property:  in  view  of  Lemma  5.1,  it  suffices  to  assume  that  the 
Lipschitz  property  holds  “in  probability”  in  the  following  sense. 

Definition  5.20  (Subgaussian  process).  A  random  process  {Xt}teT  on 
the  metric  space  ( T,d )  is  called  subgaussian  ifE[Xt]  =  0  and 

E[eA(^-x'}]  <  e*M*>«)2/2  for  aU  t,seT,  A  >  0. 

Remark  5.21.  The  subgaussian  property  should  indeed  be  interpreted  as  an 
“in  probability”  form  of  the  Lipschitz  property:  by  Problem  3.1,  the  subgaus¬ 
sian  assumption  is  equivalent  up  to  constants  to  an  assumption  of  the  form 

P[|Xt  -X„\  >  xd(t,s )]  <  Ce~x2/c. 

Note  also  that  the  assumption  E[eA^t_Xs^]  <  eA2dP>s)2 /2  already  implies 
E [Xt  —  Xs\  =  0  (as  lim;qo{ecA2/2  —  1} / A  =  0),  so  the  assumption  E[Xt]  =  0 
merely  imposes  a  convenient  normalization.  In  section  5.4,  we  will  see  how  to 
control  the  suprema  of  random  processes  with  nontrivial  mean  1 1— >  E[X*]. 

The  technique  that  we  have  outlined  above  is  known  as  chaining :  the  idea 
is  to  approximate  Xt  by  a  “chain”  X7rkm  of  increasingly  accurate  approxima¬ 
tions  (the  “links”  in  the  chain  are  the  increments  Xnk(t)  —  XWk_1m).  The  main 
remaining  difficulty  in  implementing  the  method  is  to  show  that  the  remain¬ 
der  term  does  indeed  vanish  as  n  — >  oo.  To  get  around  this,  we  will  impose  a 
very  mild  technical  assumption  that  holds  in  almost  all  cases  of  interest. 

Definition  5.22  (Separable  process).  A  random  process  {Xt}t^T  is  called 
separable  if  there  is  a  countable  set  T0  CT  such  that 


Xt  €  lim  Xs  for  all  t  €  T  a.s. 

S — *t 

sGTo 

[Here  x  €  lims^t  xs  means  that  there  is  a  sequence  sn  — t  such  that  xSn  — >  x.} 
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Remark  5.23.  The  assumption  of  separability  is  technical,  and  is  almost  always 
trivially  satisfied.  For  example,  if  t  i— >  Xt  is  continuous  a.s.,  we  can  take  T0 
to  be  any  countable  dense  subset  of  T.  At  the  same  time,  the  separability 
assumption  is  in  some  sense  intrinsic  to  the  chaining  argument.  After  all,  the 
main  idea  of  the  chaining  argument  is  to  approximate  Xt  =  linp.—oo  Xnk  (t)  for 
every  t  G  T.  If  this  is  in  fact  valid,  however,  then  the  definition  of  a  separable 
process  will  hold  for  the  countable  set  T0  =  {7rfc(t)  :  k  >  0,  t  €  T}. 

For  completeness,  let  us  note  a  somewhat  esoteric  point  that  we  swept 
under  the  rug.  If  T  is  uncountable,  suptgT  Xt  is  the  supremum  of  an  uncount¬ 
able  family  of  random  variables.  In  general,  the  supremum  of  uncountably 
many  measurable  functions  is  not  even  necessarily  measurable.  Measurability 
issues  do  arise,  on  occasion,  in  the  control  of  suprema,  but  we  will  shamelessly 
ignore  such  problems  in  these  notes.  Under  the  separability  assumption,  how¬ 
ever,  supigT  Xt  =  suptgTo  Xt  a.s.,  and  thus  no  measurability  problems  arise 
(as  a  countable  supremum  of  measurable  functions  is  always  measurable). 

We  now  have  all  the  ingredients  to  implement  the  chaining  argument. 

Theorem  5.24  (Dudley).  Let  {Xt}teT  be  a  separable  subgaussian  process 
on  the  metric  space  ( T,d ).  Then  we  have  the  following  estimate: 

<  6^2-fcV/logiV(T,d,2-fc). 

Proof.  We  first  prove  the  result  in  the  finite  case  |T|  <  oo,  which  allows  us 
to  easily  eliminate  the  remainder  term  in  the  chaining  argument.  We  subse¬ 
quently  use  the  separability  assumption  to  lift  this  restriction. 

Let  |T|  <  oo.  Let  fco  be  the  largest  integer  such  that  2~k°  >  diam(T).  Then 
any  singleton  Nk0  =  {to}  is  trivially  a  2_fc°-net.  We  therefore  start  chaining  at 
the  scale  2~k°.  For  k  >  ko,  let  N/-  be  a  2-fe-net  such  that  |iVfc |  =  N(T,  d,  2~k). 
Running  the  chaining  argument  up  to  the  scale  2~n  yields 


E 


sup  Xt 
teT 


E 

sup  Xt 

n 

<E[Xt0]+  £  E 

SUp{X7rfc(^)  ^7Tfc_i(t)} 

.  teT 

k=k,Q-\-l 

.  teT 

+  E 


sup  {A} 
.  ter 


^-TTn(t)} 


Let  us  consider  each  of  the  terms.  As  E[Ato]  =  0  by  assumption,  the  first  term 
disappears.  Moreover,  as  \T\  <  oo,  we  can  choose  n  sufficiently  large  so  that 
Nn  =  T.  Then  the  last  term  disappears.  To  control  the  terms  inside  the  sum, 
note  that  the  maximum  in  the  fcth  term  contains  at  most  |7Vfc  |  |iVfc_i  |  <  |7Vfc  | 2 
terms  (as  |iVfc_i  |  <  | TV*  | ) .  Moreover,  we  can  readily  estimate 

d(7Tfc(f),7Tfc_i(f))  <  d(t,TTk(t))  +  d(f,7Tfc_l(f))  <  3  X  2~k . 


As  X„k(t)  -  Xnk_%^  is  d(7rfc(f),7rfc_i(t))2-subgaussian,  Lemma  5.1  yields 
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E 


sup  Xt 
teT 


<  6  ^  2-fcyiog  \Nk\. 

k>ko 


But  \Nk\  =  N(T,d,2~k)  by  construction,  so  the  proof  is  complete. 

In  the  proof  we  have  used  the  assumption  |T|  <  oo  to  control  the  remainder 
term  in  the  chaining  argument.  We  now  use  separability  to  show  that  one  can 
approximate  the  general  case  by  the  finite  case.  Indeed,  by  separability,  there 
is  a  countable  subset  T'  C  T  such  that  supigT  Xt  =  suptgT/  Xt  a.s.  Denote 
by  Tfc  the  first  k  elements  of  T'  (in  arbitrary  order).  Then 


E 

sup  Xt 

=  E 

sup  Xt 

=  sup  E 

sup  Xt 

.  teT 

.  teT' 

k>  1 

.teTk 

by  monotone  convergence.  Applying  the  chaining  inequality  to  each  finite 
maximum  and  using  N(Tk,  d,  e)  <  N(T,  d,  e)  yields  the  general  result.  □ 


Very  often  the  result  of  Theorem  5.24  is  written  in  a  slightly  different 
form  by  noting  that  the  sum  can  be  viewed  as  a  Riemann  sum  approximation 
to  a  certain  integral.  There  is  no  particular  mathmatical  significance  to  this 
reformulation:  it  is  made  for  purely  aesthetic  reasons. 


Corollary  5.25  (Entropy  integral).  Let  {Xt}teT  be  a  separable  subgaus- 
sian  process  on  the  metric  space  (T,d).  Then  we  have  the  following  estimate: 


E 


sup  Xt 
teT 


i/log  N(T,  d,  e)  de. 


Proof.  We  can  readily  estimate 

2~k 

53  2-fc\A°g  N(T,  d,  2~k)  =  2  53  /  V^gN(T,d,2~k)de 

J  n  —  k  —  1 

kez  kez  z 

2~k 

<253/  \Aog ~Wr7(h£)  de 

fee 

nOO 

=  2  \/log  N(T,  d,  e)  de, 

Jo 

where  we  used  that  N (T,  d,  e)  is  decreasing  in  e.  □ 

Remark  5.26.  It  is  important  to  note  that  we  always  have  N(T,  d,e)  =  1  when 
e  >  diam(T),  as  in  this  case  any  singleton  N  =  {to}  is  trivially  an  £-net.  Thus 
it  suffices  to  take  integral  in  Corollary  5.25  only  up  to  £  =  diam(T). 

Remark  5.27.  The  logarithm  of  the  covering  number  log  N(T,d,e)  is  often 
called  metric  entropy  in  analogy  with  information  theory:  it  measures  the 
number  of  bits  needed  to  specify  an  element  of  T  up  to  precision  e.  It  is 
customary  to  refer  to  the  integral  in  Corollary  5.25  as  the  entropy  integral. 
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To  illustrate  Corollary  5.25,  let  us  revisit  Example  5.15. 

Example  5.28  (Wasserstein  law  of  large  numbers  revisited).  We  adopt  the 
same  setting  and  notation  as  in  Example  5.15.  Recall  that  we  want  to  estimate 
the  expected  Wasserstein  distance  between  the  empirical  and  true  measures 

W\{fJLn,n)  =  sup  Xf, 

/eT 

where  X±,X2,  ■  ■  ■  are  i.i.d.  variables  in  [0, 1]  with  distribution  p  and 
Xf  =  ibnXk)„-tlf’  S'  =  {/  e  Lip([0, 1])  :  0  <  /  <  1}. 

fc= l 

By  the  Azuma-Hoeffding  inequality  (Corollary  3.9),  we  have 

EjeA{A-/-A'B}j  <  e\2\\f-g\\200/2n_ 

The  process  {A/}/eg-  is  therefore  subgaussian  with  respect  to  the  metric 
d(f,g)  =  n-1/2 j|/  —  gr|| oo-  We  can  consequently  estimate  using  Corollary  5.25 

pOO 

E [W^Pn, p)]  <12  ^/log  N(3r,n~1/2\\  •  ||oo,e)  de. 

Jo 

But  it  is  easily  seen  that 

Nft , n~1/2||  •  ||oo,£)  =  N(3r ,  ||  •  ||00,n1/2e), 
so  that  changing  variables  in  the  integral  and  using  Lemma  5.16  yields 

-i  r\  pOO  i  q  ^  l~~ 

E[Wi(/in,/i)]  <  —j=  /  i/log  ^(T,  ||  •  ||oo,e)  de  <  —=  /  J -  de. 

v n  Jo  v n  Jo  V  £ 

As  e-1/2  is  integrable  at  the  origin,  we  have  proved 

E[Wi(/in,/i)]  <  n_1^2, 

which  is  a  huge  improvement  over  the  n-1/3  rate  obtained  by  the  crude 
method  used  in  Example  5.15.  It  is  evident  from  the  above  computations  that 
the  crucial  improvement  is  due  to  the  fact  that  \Xf  —  Xg\  <  n^1^2||/  —  g||oo  hr 
probability  (as  is  made  precise  by  the  subgaussian  property),  while  the  best 
almost  sure  Lipschitz  bound  one  can  hope  for  is  | Xf  —  Xg\  <  \\f  —  g\\ co¬ 
in  the  present  example,  it  is  rather  easy  to  obtain  a  matching  lower  bound 
on  the  Wasserstein  distance.  Indeed,  note  that  for  any  function  /  gj  that  is 
not  constant  p-a.s.,  we  obtain  by  the  central  limit  theorem 

E[Wi(/wi)]  >  E [Xf  V  X :w]  =  E\Xf\  ~  n~ 1/2 . 

Thus  the  rate  we  obtained  by  chaning  is  sharp  in  the  present  setting. 
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Now  that  we  understand  the  chaining  principle,  we  can  use  it  to  obtain 
more  sophisticated  results.  For  example,  just  as  we  could  obtain  a  tail  bound 
in  Lemma  5.2  corresponding  to  the  maximal  inequality  of  Lemma  5.1,  we  can 
obtain  a  tail  bound  counterpart  to  Corollary  5.25. 


Theorem  5.29  (Chaining  tail  inequality).  Let  {Xt}teT  be  a  separable 
subgaussian  process  on  the  metric  space  (T,  d).  Then  for  all  to  £  T  and  x  >  0 


P 


sup{Xt 

teT 


Xt0}  >  C 


i/log  N(T,  d,  s)  de  +  x 


^  (Jq— ®2/Cdiam(T)2 


where  C  <  oo  is  a  universal  constant. 


Proof.  The  beginning  of  the  proof  is  identical  to  that  of  Theorem  5.24,  and 
we  adopt  the  notations  used  there.  As  in  Theorem  5.24,  it  is  easily  seen  that  it 
suffices  to  consider  |Tj  <  oo,  as  we  will  assume  in  the  remainder  of  the  proof. 

The  idea  here  is  to  run  the  chaining  argument  without  taking  the  expec¬ 
tation.  As  |Tj  <  oo,  we  have  tt n(t)  =  t  for  n  sufficiently  large.  Thus 

Xt  —  Xto  =  {AAfc(t)  —  AAfc_1(t)} 
k>k0 

by  the  telescoping  property  of  the  sum.  This  elementary  chaining  identity  lies 
at  the  heart  of  the  chaining  argument.  We  immediately  obtain 

sup{At  -  Xto}  <  Y  sup{A 'nk(t)  -  XVk_l(t)}. 
teT  k>k0  teT 

Rather  than  bounding  the  expectation  of  this  quantity,  as  we  did  in  Theorem 
5.24,  we  will  bound  the  tail  behavior  of  every  term  in  this  sum.  To  this  end, 
note  that  the  subgaussian  property  of  {Xt}t^T  and  Lemma  5.2  yield 


P 


sup{XOTfe(t)  -  X„k_lW}  >6x2  Vlog  \Nk  \  +  3  x 
.  teT 


2  ~kz 


<  e 


-z2 /2 


Thus  with  high  probability,  every  link  X„k ^  —X7Tk_1(t)  at  the  scale  k  is  small. 
We  would  like  to  show  that  all  links  at  every  scale  are  small  simultaneously, 
that  is,  that  the  probability  of  the  union  over  all  k  of  the  events  in  the  above 
bound  is  small.  We  can  use  a  crude  union  bound  to  control  the  latter  prob¬ 
ability,  but  it  is  clear  that  we  must  then  choose  z  to  be  increasing  in  such  a 
way  that  the  probabilities  of  the  individual  events  are  summable:  that  is, 


P[C]  :=  P 


3k  >  ko  s.t.  sup{XWfc(t)  -  Xnk  l(t)}  >  62  k \/log  \Nk\  +32  kzk 


teT 


<YP  sup{A^(t)  -  A^_l(t)}  >  6  2- Vlog  \Nk\  +32  ~kzk 


k>ko 


L  teT 


E  c~zl/2- 

k>ko 


< 
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How  to  choose  Zk  is  not  so  important.  An  easy  choice  Zk  =  x  +  ^/k  —  ko  yields 


p  [n] 

<  £  e-*2/2  < 

e-*2/2Ye 

k>ko 

k>  0 

Now  note  that  on 

the  event  f2c,  we 

have 

sup{Xt 

VI 

o 

1 

Y  sup{XTk(t)  - 

teT 

fct*. 

< 

6  Y  2-fc\/log  \Nf.  \  +  3  2~k° 

k>k0 
r  oo 


k>  0 


k>  0 


poo 

<  C  /  yiog  AT(T,  d,  e)  de  +  C  diam(T)  a;, 

Jo 


where  we  have  used  that  2  fe°  <  2  diam(T)  and 


2~k°  <  C2~k°- 1  yffog  iV(T,  d,  2“feo-i)  <  C  Y  2"fc  yfiog  |JVfe| 

k>k0 


by  the  definition  of  fcp.  Thus 


P 


sup{Xt 

teT 


Xto}  >  C 


yiog  iV (T,  d,  e)  de  +  C  diam(T)  x 


<  P[«], 


and  the  proof  is  readily  completed. 


□ 


Remark  5.30.  Note  that  the  result  of  Theorem  5.29  is  reminiscent  of  a  concen¬ 
tration  inequality.  Indeed,  if  we  could  establish  the  concentration  inequality 


P 


sup{Xi 
.  teT 


Xt0}  >  E 


sup{Xt 
.  teT 


+  x 


^  (Jg— x2/Cdiam(T)2 


then  the  conclusion  of  Theorem  5.29  would  follow  directly  by  combining  this 
inequality  with  the  chaining  bound  of  Corollary  5.25  for  the  expected  supre- 
mum.  Despite  the  similarities,  however,  Theorem  5.29  should  not  be  confused 
with  a  concentration  inequality.  Its  conclusion  is  both  weaker  and  stronger: 
weaker,  because  Theorem  5.29  cannot  establish  a  deviation  inequality  from 
the  mean,  but  only  from  a  particular  upper  bound  on  the  mean;  stronger, 
because  the  subgaussian  assumption  of  Theorem  5.29  is  much  weaker  than 
would  be  required  to  establish  a  concentration  inequality. 


The  proof  of  Theorem  5.29  suggests  that  at  its  core,  the  chaining  method 
boils  down  to  simultaneously  controlling,  using  a  union  bound,  the  magnitude 
of  all  the  links  Xnk^—X^k_1^  in  the  chaining  identity.  We  might  therefore  ex¬ 
pect  that  chaining  yields  sharp  results  if  the  links  {Xnktt)  —  Xn k_1(t)}teT,k>k0 
are  “nearly  independent”  in  some  sense.  This  is  not  entirely  implausible,  as 
two  links  are  either  far  apart  or  are  at  a  different  scale.  It  turns  out  that  the 
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chaining  method  that  we  have  developed  here  yields  sharp  results  in  many 
cases,  but  falls  short  in  others.  In  the  next  chapter,  we  will  see  that  the  chain¬ 
ing  method  can  be  further  improved  to  adapt  to  the  structure  of  the  set  T. 
The  resulting  method,  called  the  generic  chaining ,  is  so  efficient  that  it  cap¬ 
tures  exactly  (up  to  universal  constants)  the  magnitude  of  the  supremum  of 
Gaussian  processes!  Once  this  has  been  understood,  we  can  truly  conclude 
that  chaining  is  the  “correct”  way  to  think  about  the  suprema  of  random  pro¬ 
cesses.  Nonetheless,  considering  that  we  have  ultimately  used  no  idea  more 
sophisticated  than  the  union  bound,  the  remarkably  far-reaching  power  of  the 
chaining  method  remains  somewhat  of  a  miracle  to  this  author. 


Problems 


5.9  (The  entropy  integral  and  sum).  Show  that 

pOO  pOO 

/  v'log  N(T,  d,  e)de<J2  2_  VloS  N(T’  2~k )  <  2  /  v'log  N(T,d,e)  de. 
J  0  / nr  J  0 


kez 


Thus  nothing  is  lost  in  expressing  the  chaining  bound  as  an  integral  rather 
than  a  sum,  as  we  have  done  in  Corollary  5.25,  up  to  a  constant  factor. 

5.10  (Chaining  with  arbitrary  tails).  The  chaining  method  is  not  re¬ 
stricted  to  subgaussian  processes:  it  can  be  developed  analogously  for  pro¬ 
cesses  that  are  Lipschitz  “in  probability”  in  a  more  general  sense. 

Let  {Xt}t.£T  be  a  separable  process  with  E[Xt]  =  0  and 

log  E[eA{Xt-Xs>/d(t’s)]  <  ^>(A)  for  all  t,  s  G  T,  A  >  0, 


where  xf>  is  as  in  Lemma  5.1.  Show  that 


E 


sup  Xt 

teT 


< 


xp*-\2\ogN{T,d,s))de. 


5.11  (An  improved  chaining  bound  and  Wasserstein  LLN).  The  key 

improvement  of  the  chaining  bound  of  Corollary  5.25  over  the  crude  approxi¬ 
mation  of  Lemma  5.7  is  that  the  former  uses  only  an  in  probability  Lipschitz 
property,  while  the  latter  uses  a  stronger  almost  sure  Lipschitz  property.  These 
two  ideas  are  not  mutually  exclusive,  however:  when  the  process  {Xt}teT  sat¬ 
isfies  both  types  of  Lipschitz  property,  we  can  obtain  an  improved  chaining 
bound  that  is  a  sort  of  hybrid  between  Corollary  5.25  and  Lemma  5.7. 

a.  Prove  the  following  theorem. 

Theorem  5.31  (Improved  chaining).  Let  {Xt}tGT  be  a  separable  pro¬ 
cess  that  is  both  subgaussian  (Definition  5.20)  and  almost  surely  Lipschitz 
(Definition  5.f).  Then  we  have  the  following  estimate: 


E 


sup  Xt 

t£T 


<  inf  |2eE[C]  +  12  J  i/log  N(T,  d ,  e)  cfe | . 
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Hint:  run  the  chaining  argument  only  up  to  scale  2  ”  and  use  the  almost 
sure  Lipschitz  property  to  estimate  the  remainder  term. 

To  understand  the  advantage  of  Theorem  5.31,  we  first  note  the  following. 

b.  Show  that  N(T,  d,  e)  diverges  as  e  j  0  whenever  |T|  =  oo. 

As  the  covering  number  diverges,  a  nontrivial  application  of  Corollary  5.25  re¬ 
quires  that  this  divergence  is  sufficiently  slow  that  ydog  N(T,  d,  e)  is  integrable 
at  zero.  This  is  not  always  the  case.  On  the  other  hand,  Lemma  5.7  would 
give  a  nontrivial  bound  even  when  the  covering  number  is  not  integrable,  but 
the  use  of  the  almost  sure  Lipschitz  property  yields  a  very  pessimistic  bound. 
Theorem  5.31  provides  the  best  of  both  worlds:  it  uses  the  “in  probability” 
Lipschitz  property  as  much  as  possible,  while  using  the  almost  sure  Lipschitz 
property  to  cut  off  the  divergent  part  of  the  integral. 

To  illustrate  the  efficiency  of  Theorem  5.31,  let  us  revisit  once  more  the 
Wasserstein  law  of  large  numbers.  We  have  resolved  completely  the  rate  of 
convergence  in  one  dimension  in  Example  5.28.  However,  in  higher  dimensions, 
we  have  so  far  only  obtained  pessimistic  rates  in  Problem  5.8. 

c.  Show  that  we  cannot  obtain  any  nontrivial  bound  for  the  Wasserstein  law 
of  large  numbers  in  dimensions  d  >  2  from  Corollary  5.25. 

d.  Using  Theorem  5.31,  show  that  in  the  setting  of  Problem  5.8 

r  n-1/2  for  d  =  1, 

E[Wd(/z„,  p)]  <  <  n-1/2logn  for  d=  2, 

\n~x/d  for  d  >  3. 

Unlike  in  the  one-dimensional  case,  a  lower  bound  (and  hence  the  sharpness 
of  the  above  estimates  for  the  rates)  is  not  immediately  obvious  in  dimensions 
d  >  2.  We  must  work  a  little  bit  harder  to  obtain  some  insight. 

e.  Suppose  that  p(dx)  =  p(x)dx  with  ||p||oo  <  oo.  Show  that 


E 


min  ||  x  —  Xi\ 


>  nrxJd 


Hint:  use  P[minj<„  ||ar  —  Xi\\  >t]  =  P[||x  — 


for  all  x  G  [0,  l]d. 

Ad  ||  >  t]n  and  integrate. 


f.  Conclude  that  when  p  has  a  bounded  density,  we  have  in  any  dimension  d 

E[Wi(/zn,/z)]>ri-1/d. 

Hint:  consider  the  (random)  function  f(x)  =  —  minj<n  ||x  —  Ad||. 

Taking  together  all  the  upper  and  lower  bounds  that  we  have  proved  for  the 
Wasserstein  law  of  large  numbers,  we  have  evidently  obtained  sharp  rates 
~  n-1/2  in  dimension  d  =  1  and  ~  n-1/d  in  dimension  d  >  3.  The  only  case 
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still  in  question  is  dimension  d  =  2,  where  there  remains  a  gap  between  our 
lower  and  upper  bounds  nr1/2  <  E[Wi(/x„,  p)]  <  n_1/2log n.  It  turns  out 
that  neither  bound  is  sharp  in  this  case:  the  correct  rate  is  ~  n_1/2(log  n)1^2. 
It  has  been  shown  by  Talagrand  that  this  rather  deep  result,  due  to  Ajtai, 
Komlos,  and  Tusnady,  can  be  derived  (in  a  nontrivial  manner)  using  the  more 
sophisticated  generic  chaining  method  that  will  be  developed  in  Chapter  6. 


5.4  Penalization  and  the  slicing  method 

Up  to  this  point  we  have  considered  the  suprema  of  subgaussian  processes, 
which  are  necessarily  centered  E[Xt]  =  0  (or  at  least  E[At  —  Xs]  =  0  for  all 
t,  s).  It  is  often  of  interest,  however,  to  consider  random  processes  that  have 
nontrivial  mean  behavior  t 1— >  E[At].  To  this  end,  let  us  decompose 

Xt  =  E  [Xt]  +  Zt 

in  terms  of  its  mean  E[Xt]  and  fluctuations  Zt  =  Xt  —  E[Xt].  It  is  natural  to 
assume  that  the  fluctuations  {Zt}t^T  form  a  subgaussian  process.  As 

sup  Xt  =  sup  {Zt  +  E[Xt]}, 
ter  teT 

the  problem  of  controlling  the  supremum  of  {Xt}tGT  can  evidently  be  inter¬ 
preted  as  the  problem  of  controlling  the  penalized  supremum  of  a  subgaussian 
process,  where  E[At]  plays  the  role  of  the  penalty.  The  chaining  method  is  well 
suited  to  controlling  the  fluctuations,  but  not  to  controlling  the  penalty.  The 
aim  of  this  section  is  to  develop  a  technique,  called  the  slicing  method ,  that 
reduces  the  problem  of  controlling  a  penalized  supremum  of  a  subgaussian 
process  to  controlling  a  subgaussian  process  without  penalty.  As  penalized 
suprema  arise  in  many  settings,  the  slicing  method  is  an  important  part  of 
the  toolbox  needed  to  control  the  suprema  of  random  processes. 

There  is,  in  fact,  nothing  special  about  the  specific  additive  form  of  the 
penalty:  the  slicing  method  will  prove  to  be  useful  in  other  cases  as  well.  For 
example,  in  various  situations  it  is  of  interest  to  control  a  weighted  supremum 

Xt-Xa 
SUP  — 7T~Y~ 

t,ser  p{t,  s) 

of  a  subgaussian  process  {Xt}teT  for  some  suitable  function  p  that  should  be 
viewed  as  a  multiplicative  (rather  than  additive)  penalty.  One  could  of  course 
view  Xt>s  =  {Xt  —  Xs}/p(t,s)  as  a  new  stochastic  process  whose  supremum 
we  wish  to  compute,  but  it  is  generally  far  from  clear  that  this  process  is 
subgaussian  with  respect  to  a  natural  distance.  In  such  situations,  the  slicing 
method  will  once  again  provide  an  important  tool  to  handle  the  penalty. 

Let  us  illustrate  the  basic  idea  behind  the  slicing  method  in  the  multi¬ 
plicative  setting  (the  additive  setting  works  much  in  the  same  way).  Fix  a 
sequence  ctk  |  0  such  that  p(s,  t)  <  oq  for  all  s,  t.  Then  we  can  evidently  write 
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P 


sup 

s,teT 


Xt-Xa 

p{t .  s) 


>  X 


=  P 


sup  sup 

L  ak<p(s,t)<otk- 1 


Xt-Xa 
pit ,  s) 


>  x 


That  is,  we  decompose  the  supremum  over  “slices”  {(s,  t)  :  ak  <  p(s,i)  < 
of  the  index  set  T  xT.  The  key  point  is  that  on  each  slice,  the  penalty 
is  controlled  both  from  above  and  from  below,  so  that  it  can  be  eliminated 
from  the  supremum.  We  can  therefore  estimate,  using  a  union  bound, 


Xt-Xa 
sup  — j— — 

s,ter  p[t,  s) 


>  X 


k— 1 


xt-xa 

sup  — — —  >  x 

Lafc<p(s,t)<c«fc_i  PV">S) 


k= 1 


sup  {Xt 

-  p(s,t)<(Xk-l 


As}  >  akx 


Each  probability  inside  the  sum  on  the  right-hand  side  is  the  tail  of  the  supre¬ 
mum  of  a  subgaussian  process  without  penalty.  However,  the  penalty  still 
appears  implicitly,  as  it  determines  the  subset  of  the  index  set  over  which  the 
supremum  is  taken  in  each  term  in  the  sum.  This  subset  is  getting  smaller 
as  k  increases,  which  will  decrease  the  probability;  at  the  same  time,  the 
threshold  akx  also  decreases,  which  will  increase  the  probability.  To  be  able 
to  control  the  weighted  supremum,  we  must  therefore  balance  these  compet¬ 
ing  forces:  that  is,  the  penalty  must  chosen  in  such  a  way  that  the  size  of  the 
set  {pit,  s )  <  afc_i}  shrinks  sufficiently  rapidly  as  compared  to  the  level  ak  to 
render  the  probabilities  summable.  This  basic  idea  is  common  to  all  applica¬ 
tions  of  the  slicing  method:  however,  its  successful  implementation  requires  a 
bit  of  tuning  that  is  specific  to  the  setting  in  which  it  is  applied.  Once  the  idea 
has  been  understood  in  detail  in  one  representative  example,  the  application 
of  the  slicing  method  in  other  situations  is  largely  routine;  several  examples 
will  be  encountered  in  the  problems  at  the  end  of  this  chapter. 

As  a  nontrivial  illustration  of  the  slicing  method,  we  will  presently  develop 
in  detail  a  very  useful  general  result  on  weighted  suprema:  we  will  control  the 
modulus  of  continuity  of  subgaussian  processes.  This  result  is  of  significant 
interest  in  its  own  right,  as  it  sheds  new  light  on  the  meaning  of  the  entropy 
integral  that  apprears  in  Corollary  5.25.  An  increasing  function  to  such  that 
oj(Q)  =  0  is  called  a  modulus  of  continuity  for  the  random  process  {Xt}teT 
on  the  metric  space  (T,  d)  if  there  is  a  random  variable  C  such  that 


Xt  —  Xs  <  Kco{d{t,  s))  for  all  t,s  €  T. 


Evidently  the  function  to  controls  the  “degree  of  smoothness”  of  t  i— >  Xt.  To 
show  that  w  is  a  modulus  of  continuity,  it  clearly  suffices  to  prove  that 


r,  Xt  -  Xs 

K  =  sup  — — — 

t,seT  u>id{t,s )) 


<  00 


a.s. 


To  this  end,  we  will  prove  the  following  result. 
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Theorem  5.32  (Modulus  of  continuity).  Let  {Xt}teT  be  a  separable  sub- 
gaussian  process  on  the  metric  space  ( T,d ).  Assume  that  N(T,d,e)  >  ( c/e)q 
for  some  constants  c,q>  0  and  all  e  >  0.  Then  the  function 

r*  , _ 

u(S)  =  /  y/log  N(T,  d,  e)  de 
Jo 

is  a  modulus  of  continuity  for  {Xt}teT-  In  particular,  we  have 


E 


Xt-X, 

SUP  .  , - r.  , 

t,s£T  w{d{t,s))  J 


<  00. 


Theorem  5.32  provides  us  with  new  insight  on  the  relevance  of  the  entropy 
integral  in  Corollary  5.25:  the  latter  controls  not  only  the  magnitude  of  the 
supremum  of  the  process,  but  in  fact  even  its  degree  of  smoothness! 

Remark  5.33.  An  explicit  tail  bound  on  the  quantity  supt  s{Xt—Xs}/u)(d(t,  s)) 
can  be  read  off  from  the  proof  of  Theorem  5.32. 

Remark  5.34-  The  technical  condition  N(T,d,e)  >  ( c/e)q  required  by  Theo¬ 
rem  5.32  is  very  mild:  it  states  that  the  metric  dimension  of  (T,  d)  is  nonzero 
(cf.  Remark  5.14).  This  is  the  case  in  almost  all  situations  of  practical  interest. 
Nonetheless,  this  condition  proves  to  be  purely  technical,  and  it  can  be  shown 
that  uj  as  defined  in  Theorem  5.32  is  still  a  modulus  of  continuity  for  {Xt}tGx 
even  in  the  absence  of  the  technical  condition.  The  proof  of  this  fact  is  in  the 
same  spirit  as  that  of  Theorem  5.32,  but  requires  a  more  delicate  tuning  of  the 
slicing  and  chaining  method  that  does  not  provide  much  added  insight.  We 
avoid  the  added  complications  by  imposing  the  additional  technical  condition 
in  order  to  provide  a  clean  illustration  of  the  slicing  method. 

To  control  the  terms  that  appear  in  the  slicing  method,  we  need  a  local 
version  of  the  chaining  inequality  of  Theorem  5.29  where  the  supremum  is 
taken  over  t,s  £  T  such  that  u>(d(t,  s))  <  a*,.  Such  a  local  inequality,  which  is 
very  useful  in  its  own  right,  can  be  derived  rather  easily  from  Theorem  5.29. 

Proposition  5.35  (Local  chaining  inequality).  Let  {Xt}teT  be  a  separa¬ 
ble  subgaussian  process  on  the  metric  space  (T,  d).  Then  for  all  x,  S  >  0 


P 


sup  {Xt 

t,ser 

d(t,s)<5 


rs 

XS}>C  /  i/log  N{T,  d,  e)  de  +  x 

Jo 


<  Ce-*2/™2 


Proof.  Define  the  random  process  {AtiS}jt  s.ef  as 


Xt,,  =  xt-x. 


T={(t,s )  £  T  x  T  :  d(t,  s)  <  5}. 


Using  the  subgaussian  property  of  {Xt}t^T  and  Cauchy-Schwarz,  we  estimate 
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E[eA{*M-*„,4]  =  E[eX{Xt-Xu}eX{Xs-Xv}] 

<  E[e2A{Jft-Jf„}jl/2Eje2A{Xs-X4jl/2 
/  „\2{d(t.u)2+d(s,v)2} 

—  (  5 

and  by  an  entirely  analogous  argument 

E[eA  {Xt,s-Xu,v}]  <  E[e2A{A-t-Xa}jl/2Eje2A{JC„-.Yt.}jl/2  <  g2A 262  _ 


If  we  define  the  metric  d  on  T  as 

d((t,  s ),  ( u ,  v))  =  2X!2  \J  d(t,u)2  +  d(s,v)2  A  2<5, 

we  see  that  {Xt)g}^t  is  a  subgaussian  process  on  the  metric  space  (T,  d). 
As  diam(T)  <  26  (and  thus  N(T,  d,e)  =  1  for  £  >  28),  we  obtain 


(*28 


sup  Xtts  >  C 

(t,s)ef 


log  N  (' T ,  d,  e)de  +  x 


<  Ce-*2'cs2 


by  Theorem  5.29.  Note  that  if  N  is  an  £-net  for  (T,  d),  then  N  x  N  is  a  2e-net 
for  ( T,d ).  As  \N  x  N\  =  |iV|2,  we  obtain  N{T,d,2e)  <  N{T,d,e)2 .  Thus 


{•28 


\J log  N (T,  d,  e)  de  <  2\ / 2  J  \/log  N(T,  d ,  e)  de, 


and  the  proof  is  readily  completed. 

We  can  now  complete  the  proof  of  Theorem  5.32. 

Proof  (Theorem  5.32).  The  slicing  argument  with  cek  =  oj{A2~k)  yields 
Xt-X, 


sup  .  .  .. 

«,teT  u{d{t,s)) 


>  X 


k= 1 


sup  {Xt  —  As}  >  uj(A2  k)x 

d(s,t)<A2~k  +  1 


where  we  define  A  =  diam(T)  for  simplicity.  We  would  like  to  apply  Propo¬ 
sition  5.35  to  each  term  in  the  sum.  The  problem  is  that  here  the  integral 
ix(A2~k)  goes  only  up  to  the  scale  A2~k ,  while  the  supremum  is  taken  up  to 
a  larger  scale  A2~k+1 ;  in  Proposition  5.35,  the  two  scales  must  be  the  same. 
To  resolve  this  issue,  note  that  as  e  i— >  N(T,  d,  e)  is  a  decreasing  function 


n28 


rs 

ydog  N(T,  d,  e)  de<  i/log  N(T,  d,e)  de 

Jo 


for  every  8  >  0,  so  that  in  particular  oj(28)  <  2u>(8).  Therefore 
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Xt~Xs 
sup  — — — — 
a, ter  u!{d(t,  s)) 


>  2  (C  +  x) 


A2~k+1 


1>L az  1 

<  VP  sup  {Xt-Xs}>(C  +  x)  \/log  N(T,  d,  e)  de 

L  d(s,t)<A2~k+1  JO 


<  ^  C(,  ^  („.  '  ■  foA2  fc+Vlog  N(T,d,e)  de) ' 

k= 1 


oo 

Ce-x2  logN(T,d,A2~k+1)/C 

k- 1 


where  we  have  used  Proposition  5.35  and  that  £  i— >  N(T,d,e)  is  decreasing. 
We  now  note  that  the  technical  assumption  N(T,d,e)  >  ( c/e)q  implies  that 
log  N(T,  d,  Z\2_fc+1)  grows  at  least  linearly  in  k.  Thus  the  above  sum  is  a 
geometric  series,  and  we  readily  obtain  an  estimate  of  the  form 


sup 

s.teT 


xt-xs 

u>(d(t,s)) 


>2  C  +  x 


<  Ae~x2/A 


for  all  x  >  1, 


where  C  is  the  universal  constant  from  Proposition  5.35  and  A  is  a  constant 
that  depends  on  c,  q  only.  Integrating  the  tail  bound  yields  the  conclusion.  □ 

Remark  5.36.  The  proof  of  Theorem  5.32  highlights  the  competing  demands 
on  our  choice  of  slicing  sequence  «&.  On  the  one  hand,  we  want  oik-i  and 
ak  to  be  sufficiently  close  together  that  the  scales  at  which  the  supremum 
and  the  tail  probability  are  evaluated  are  of  the  same  order  in  each  term  in 
the  slicing  argument.  This  requires  that  the  sequence  ak  converges  not  too 
quickly.  On  the  other  hand,  we  want  ak-i  and  ak  to  be  sufficiently  far  apart 
that  the  probabilities  in  the  slicing  bound  are  summable.  This  requires  that  the 
sequence  ak  converges  not  too  slowly.  In  the  proof  of  Theorem  5.32,  we  initially 
chose  a  geometric  sequence  ak  =  aj(A2~k)  to  ensure  that  ak  <  ak-i  <  2a*. 
are  not  too  far  apart;  we  subsequently  imposed  the  technical  condition  on  the 
covering  numbers  to  ensure  that  the  probabilities  are  summable. 

To  illustrate  Theorem  5.32,  let  us  prove  a  classical  result  in  stochastic 
analysis  due  to  P.  Levy  on  the  modulus  of  continuity  of  Brownian  motion. 

Example  5.37  (Modulus  of  continuity  of  Brownian  motion).  Let  {5t}tg[01]  be 
standard  Brownian  motion.  As  Bt  —  Bs  is  Gaussian,  we  compute  exactly 

E[eA{Bt-Bs}]  =  eA2|i-S|/2) 

Thus  {-Bt}te[o,i]  is  subgaussian  on  ([0, 1],  d)  with  the  metric  d(t,  s)  =  yf\ t  —  s|. 
Moreover,  by  Lemma  5.13,  we  readily  obtain  the  estimates 

l<7V([0,l],d,£)=7V([0,l],|-|,£2)<| 
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for  e  <  1.  Thus  Theorem  5.32  states  that 

| Bt  -  Bs |  <  u{^/\t  —  s|)  for  all  t,s£  [0, 1]  a.s., 

where 


u(6)  =  J  \j log  ^  de  <  S\j\og-5. 


That  is,  the  sample  paths  of  Brownian  motion  are  slightly  less  smooth  than 
Holder- \  by  a  logarithmic  factor.  It  is  easy  to  see  that  this  result  is  sharp! 
Indeed,  note  that  as  Brownian  motion  has  independent  increments, 


I  Bt  -  B„ 


sup  — — '  >  max 

|t-s|<e  w(V|t  -  s|) 


Bne  R(n—  l)e  ^  1 1 1  a. X , ,  <  \ 


Oj(y/s)  ~  y/\ogN 
where  N  =  e_1  and  Xn  =  s~1^2{Bne  —  S(n_i)e}  are  i.i.d.  ~  1V(0, 1).  Thus 
I  Bt  -  Bs 


E 


lim  sup  ■ 


E  max„<jv  Xn] 

>  Inn  sup  — - -  ~  - -  >  0 

n^oo  ylog  N 


it— s|10  w(v/| t  -  8\)_ 
by  Problem  5.1,  so  the  modulus  of  continuity  —  s|)  is  evidently  sharp. 


Problems 

5.12  (Empirical  risk  minimization  I:  slicing).  Empirical  risk  minimiza¬ 
tion  is  a  simple  but  fundamental  idea  that  arises  throughout  machine  learning, 
statistics  (where  it  is  often  called  M-estimation) ,  and  stochastic  programming 
(where  it  is  called  sample  average  approximation).  The  basic  problem  can  be 
phrased  as  follows.  Let  (T,  d)  be  a  metric  space,  and  consider  a  given  family 
of  functions  {/t  :  t  £  T}  on  some  probability  space  (X,/t).  We  define  the  risk 
of  t  £  T  as  R(t)  :=  /i/t.  Our  goal  is  to  select  t*  £  T  that  minimizes  the  risk: 

t*  :=  argminf?(t)  :=  argmin 

teT  teT 

However,  it  may  be  impossible  to  do  this  directly:  either  because  the  measure 
/i  is  unknown  (in  machine  learning  and  statistics),  or  because  computing  in¬ 
tegrals  with  respect  to  /r  is  intractable  (in  stochastic  programming).  Instead, 
we  assume  that  we  have  access  to  n  i.i.d.  samples  X±, . . . ,  Xn  ~  /i.  By  the  law 
of  large  numbers,  the  risk  should  be  well  approximated  by  the  empirical  risk 

1  n 

R{t)  ~  fin ft  -  Y]  ft(Xk) 

nti 

when  the  sample  size  n  is  large.  The  empirical  risk  minimizer 

t„  ■=  argmin pnft 
teT 

should  therefore  be  a  good  approximation  of  the  optimum  t* .  We  would  like 
to  find  out  how  good  of  an  approximation  this  is:  that  is,  we  would  like  to 
bound  the  excess  risk  R(tn )  —  R(t*)  of  the  empirical  risk  minimizer. 
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a.  Argue  that 


P[R(tn)  -  R(t*)  >S]<  P 


SUp  fin  (ft* 
teT 

R(t)  —  R(t*)>5 


ft)>  o 


Hint:  use  that  —  ff  )  >  0  by  construction. 

b.  Define  the  random  process  Xt  :=  jin ( ft*  —  ft)-  Note  that  Xt  is  not  centered, 
so  that  we  cannot  apply  chaining  directly.  However,  show  that 

Zt  ■■=  nV2{Xt  +  R{t)-R(t*)} 


is  subgaussian  on  (T,  d)  with  the  metric  d(t,  s)  :=  ||/t  —  /s||oo- 


c.  Use  the  slicing  argument  to  show  that 


P [R(in)  -  R(t*)  >  <5]  <  Ep 

fc= l 


sup  Zt 

R(t)-R(t*)<82k 


>  52k~1n1'2 


d.  The  bound  we  have  obtained  already  suffices  to  obtain  a  crude  upper  bound 
on  the  magnitude  of  the  excess  risk:  show  that  if 

yfiog  N(T,  d ,  s)  de  <  00, 

then  we  have 

R(tn)  -  R(t*)  =  0P{n^2). 

Hint:  set  6  =  n~xl2(K  +  x)  for  a  sufficiently  large  constant  K ,  and  replace 
the  supremum  in  the  slicing  bound  by  the  supremum  over  the  entire  set  T . 

The  above  bound  on  the  excess  risk  is  exceedingly  pessimistic.  Indeed,  if  we 
set  5  =  Kn~1/2,  then  the  suprema  in  the  slicing  bound  are  taken  over  the 
sets  Tk:n  =  {t  £  T  :  R(t)  —  R(t*)  <  K2kn~ 1^2}  which  shrink  rapidly  as  n 
increases.  Thus  these  suprema  should  be  much  smaller  than  is  captured  by  our 
crude  estimate  on  the  excess  risk,  where  we  have  entirely  ignored  this  effect. 
However,  we  cannot  obtain  more  precise  rates  unless  we  are  able  to  control 
the  sizes  of  the  sets  Tfc,  and  this  requires  to  impose  a  suitable  assumption  on 
the  risk  R(t).  To  this  end,  it  is  common  to  assume  that  a  margin  condition 


R(t)  -  R(t*)  >  ( d(t,  t*) /ci)“  for  all  t  e  T 


holds  for  some  constants  C\  >  0  and  a  >  1. 

e.  Assume  that  the  margin  condition  holds  and  that 

V'log  N(T,  d,  e)  de  <  C2S13 
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for  some  C2  >  0  and  0  <  /3  <  1.  Show  that 

R(in)  -  R{t*)  =  op(n~aMa-®). 

Hint:  choose  5  =  Czn~a^2^a~^  in  the  slicing  bound  for  a  sufficiently  large 
constant  C3  (depending  on  ci,  C2,  a,  (3).  Then  we  can  estimate 

/*ci<51/a2fc/Q: 

C  v^og  N(T>  £ )  de  <  52k~2n1/2, 

Jo 

and  thus  it  is  possible  to  apply  Proposition  5.35. 

Remark  5.38.  The  bounds  obtained  in  the  previous  problem  are  often  unsat¬ 
isfactory  in  practice.  The  reason  is  that  we  have  endowed  T  with  the  uniform 
norm  d(t ,  s )  :=  ||/t  —  /s||ooi  which  is  too  stringent  in  most  applications:  it  diffi¬ 
cult  both  to  satisfy  the  margin  condition  and  to  control  the  covering  numbers 
for  such  a  strong  norm.  The  uniform  norm  is  the  best  we  can  hope  for,  how¬ 
ever,  if  we  use  only  the  subgaussian  property  of  {Zt}t^T  (Azuma-HoefTding). 
Later  in  this  course,  we  will  develop  new  tools  from  empirical  process  theory 
that  make  it  possible  to  obtain  uniform  bounds  on  the  supremum  of  empirical 
averages  /z„/  —  /r/  under  much  weaker  norms.  With  this  machinery  in  place, 
however,  the  slicing  argument  will  go  through  precisely  as  we  used  it  above. 

5.13  (Empirical  risk  minimization  II:  modulus  of  continuity).  The 

goal  of  this  problem  is  to  outline  an  alternative  proof  of  the  results  obtained 
in  the  previous  problem:  rather  than  employing  the  slicing  argument  directly, 
we  will  deduce  the  bound  on  the  excess  risk  from  the  modulus  of  continuity  of 
the  process  {Zt}teT-  This  is  not  really  different,  of  course,  as  one  must  still  use 
slicing  (in  the  form  of  Theorem  5.32)  to  control  the  modulus  of  continuity.  The 
main  point  of  the  present  problem,  however,  is  to  emphasize  that  the  modulus 
of  continuity  arises  naturally  in  the  empirical  risk  minimization  problems. 

In  the  sequel,  we  work  in  the  same  setting  as  in  the  previous  problem. 

a.  Show  that 

R(in)  -  R(t*)  <  Mn(/t*  -  fij  ~  M/t*  -  fij  =  n~1/2ztn- 
Hint:  use  that  /r„(/t.  —  ff  )  >  0  by  construction. 

b.  Show  directly  (without  slicing)  that  if 

y^log  N(T,  d ,  e)  de  <  OO, 

then  we  have 

E [R(tn)  -  R(t*)]  <  n-1'2. 
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c.  The  reason  that  the  above  bound  is  pessimistic  is  that  tn  —>  t*,  so  we  expect 
that  —  Zt*  <  sup tGT{Zt  —  Zt*}.  To  capture  this  behavior,  suppose  that 
w(<5)  =  5^  is  a  modulus  of  continuity  for  {Zt  jter,  so  Z?  -  Zt .  <  dtfmt*)0 

a.s.  If  in  addition  the  margin  condition  holds,  show  that  this  implies 

R(tn)  -  R(t*)  <  n-a^a~^  a.s. 


d.  Deduce  the  conclusion  of  the  previous  problem  from  the  off-the-shelf  mod¬ 
ulus  of  continuity  result  obtained  in  Theorem  5.32. 

5.14  (Law  of  iterated  logarithm).  A  classical  application  of  the  slicing 
method  in  probability  theory  is  the  proof  of  the  law  of  iterated  logarithm.  In 
this  problem,  we  will  prove  the  simplest  form  of  such  a  result. 

Let  XUX2,...  be  i.i.d.  Gaussian  random  variables  with  zero  mean  and 
unit  variance.  We  aim  to  show  the  law  of  iterated  logarithm 

1  n 

,  .  Y]  Xk  <  1  a.s. 

log  log  n 


lim  sup  _ 

n — »oo  V  £Tl 


(in  fact,  with  a  bit  more  work  one  can  prove  that  equality  holds  a.s.) 
a.  Use  the  slicing  method  to  show  that  for  (3  >  1  and  m  £  N 


SUP  / 

n>/3m  v^n  log  log  n 


J2Xk- 


k=l 


i=r 


max  X  Xk  >  Xy/2(3e{l og  l  +  log  log  j3} 


n<[3e-+ 


k= 1 


b.  Prove  the  following  maximal  inequality: 


P 


sup  2.  Xk  >  x 

n^Nk= i 


<  e~x2'2N. 


Hint:  without  the  sup,  this  is  the  Chernoff  bound  for  Gaussian  variables. 
Now  note  that  Mn  =  J2k=  l  a  martingale,  so  eAM"  is  a  submartingale. 

Improve  the  Chernoff  bound  using  Doob’s  submartingale  inequality. 


c.  Show  that  whenever  x2  >  (3 


lim  P 

m—>  oo 


sup 

n^/377 


V2nlog  logn 


EXfc  - 


fc= l 


=  0, 


and  conclude  the  form  of  the  law  of  iterated  logarithm  stated  above. 
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5.15  (Maxima  of  independent  Gaussians).  Let  {Xn}n>o  be  i.i.d.  N( 0, 1) 

random  variables.  Of  course,  it  is  trivially  seen  that  sup„  Xn  =  oo  a.s.,  so  there 
is  nothing  interesting  to  be  said  about  the  supremum  of  the  process  {Xn}n>o 
itself.  However,  even  when  the  supremum  of  a  process  is  infinite,  the  penalized 
supremum  can  still  be  finite  if  the  penalty  is  chosen  appropriately. 

a.  Let  a„  |  oo.  Show  that  supn  Xn/an  <  oo  if  and  only  if  an  >  yTog  n. 

b.  Let  bn  |  oo.  Show  that  sup„{X„  —  bn}  <  oo  if  and  only  if  bn  >  y/Jogli. 


Notes 

§5.1.  The  use  of  union  bounds  to  estimate  maxima  of  independent  random 
variables  is  classical.  The  proof  of  Lemma  5.1  arises  naturally  from  the  devel¬ 
opment  of  maximal  inequalities  in  terms  of  Orlicz  norms,  cf.  [66];  the  present 
formulation  is  taken  from  [13].  Orlicz  norms  make  it  possible  to  define  bona 
fide  Banach  spaces  of  random  variables  with  given  tail  behavior,  and  are 
therefore  particularly  useful  in  a  functional-analytic  setting.  The  Johnson- 
Lindenstrauss  lemma  (Problem  5.3)  can  be  found,  for  example,  in  [56]. 

§5.2.  Covering  and  packing  numbers  were  first  studied  systematically  in  the 
beautiful  paper  of  Kolmogorov  and  Tikhomirov  [47],  which  remains  surpris¬ 
ingly  modern.  The  covering  number  estimates  of  finite-dimensional  balls  and 
of  Lipschitz  functions  are  already  obtained  there.  The  application  of  Lemma 
5.7  is  often  referred  to  as  “an  e-net  argument”;  it  is  the  simplest  and  most 
classical  method  to  bound  the  supremum  of  a  random  process.  Much  more  on 
estimating  the  norm  of  a  random  matrix  can  be  found  in  [95] . 

§5.3.  The  chaining  method  appears  in  any  first  course  on  stochastic  processes 
in  the  form  of  the  Kolmogorov  continuity  theorem  [46,  Theorem  2.2.8].  It  was 
developed  by  Kolmogorov  in  1934  but  apparently  never  published  by  him  (see 
[19]).  The  general  formulation  for  (sub)gaussian  processes  in  terms  of  covering 
numbers  is  due  to  Dudley  [27] .  A  method  of  chaining  using  Orlicz  norms  due  to 
Pisier  [66]  has  become  popular  as  it  yields  tail  bounds  without  any  additional 
effort.  The  tail  bound  of  Theorem  5.29  (whose  proof  was  inspired  by  [96])  is 
much  sharper,  however,  and  we  have  therefore  avoided  chaining  with  Orlicz 
norms.  A  different  approach  to  deriving  sharp  chaining  tail  bounds  can  be 
found  in  [51,  section  11.1].  The  sharp  rates  of  convergence  for  the  Wasserstein 
LLN  stated  in  Problem  5.11  can  be  found  in  [3]  (see  also  [88]). 

§5.4.  The  idea  behind  the  slicing  (also  known  as  peeling  or  stratification) 
method  already  arises  in  the  classical  proof  of  the  law  of  iterated  logarithm 
(Problem  5.14)  and  has  a  long  history  of  applications  to  empirical  processes. 
Theorem  5.32  appears,  without  the  additional  technical  condition,  in  [28]. 
Problems  5.12  and  5.13  only  give  a  flavor  of  numerous  applications  of  these 
ideas  in  mathematical  statistics;  see  [38,  37]  for  much  more  on  this  topic. 
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Gaussian  processes 


In  the  previous  chapter,  we  developed  the  chaining  method  to  bound  the 
suprema  of  subgaussian  processes.  This  provides  a  powerful  tool  that  is  useful 
in  many  applications.  However,  at  this  point  in  the  course,  it  is  not  entirely 
clear  why  this  method  is  so  effective:  at  first  sight  the  method  appears  quite 
crude,  being  at  its  core  little  more  than  a  conveniently  organized  union  bound. 
It  is  therefore  a  remarkable  fact  that  some  form  of  the  chaining  method  suffices 
in  many  situations  (in  some  cases  in  a  more  sophisticated  form  than  was 
developed  in  the  previous  chapter)  to  obtain  sharp  results. 

To  understand  when  the  chaining  method  is  sharp,  we  must  supplement 
our  chaining  upper  bounds  in  terms  of  corresponding  lower  bounds.  It  is  clear 
that  we  cannot  expect  to  obtain  sharp  lower  bounds  at  the  level  of  generality 
of  subgaussian  processes;  even  in  the  case  of  finite  maxima,  we  have  seen  that 
we  need  the  additional  assumption  of  independence  to  obtain  lower  bounds. 
In  the  case  of  general  suprema,  a  more  specific  structure  is  needed.  In  this 
chapter  we  will  investigate  the  case  of  Gaussian  processes,  for  which  a  very 
precise  understanding  of  these  questions  can  be  obtained. 

Definition  6.1  (Gaussian  process).  The  random  process  {Xt}t^T  is  called 
a  ( centered)  Gaussian  process  if  the  random  variables  {Xtl ,  •  •  •  ,  Xtn  }  are  cen¬ 
tered  and  jointly  Gaussian  for  all  n  >  1,  t\,  ■  ■  ■  ,  tn  €  T. 

There  are  several  reasons  to  concentrate  on  Gaussian  processes: 

1.  Gaussian  processes  arise  naturally  in  many  important  applications,  both 
explicitly  and  implicitly  as  a  mathematical  tool  in  proofs. 

2.  Gaussian  processes  provide  us  with  the  simplest  prototypical  setting  in 
which  to  investigate  and  understand  chaining  lower  bounds. 

3.  Our  investigation  of  Gaussian  processes  will  give  rise  to  new  ideas  and 
methods  that  are  applicable  far  beyond  the  Gaussian  setting. 

Remark  6.2.  In  the  sequel,  all  Gaussian  processes  will  be  assumed  to  be  cen¬ 
tered  (that  is,  E[Xt]  =  0)  unless  stated  otherwise.  Some  methods  to  deal  with 
non-centered  processes  were  discussed  in  section  5.4. 
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Let  us  remark  at  the  outset  that  for  a  Gaussian  process  {Xt}teT,  we  have 

EjeA{Xt-XJj  =  eA2E[|.Yt-.Y„|2]/2^ 

Thus  a  Gaussian  process  determines  a  canonical  metric  on  the  index  set  T. 

Definition  6.3  (Natural  distance).  A  Gaussian  process  {Xt}t^T  is  sub- 
gaussian  on  ( T,d )  for  the  natural  distance  d(t,s)  :=  E[|At  —  A^2]1/2. 

Gaussian  processes  {Xt}tGT  will  always  be  considered  as  being  defined  on 
(T,  d)  endowed  with  the  natural  distance  d.  As  we  will  see  in  the  sequel,  the 
magnitude  of  the  suprema  of  Gaussian  processes  can  be  understood  completely 
(up  to  universal  constants)  in  terms  of  chaining  under  the  natural  distance. 
Once  this  has  been  understood,  we  can  truly  conclude  that  chaining  is  the 
“right”  way  to  think  about  the  suprema  of  random  processes. 


6.1  Comparison  inequalities 


How  can  we  obtain  a  lower  bound  on  the  expected  supremum  of  a  Gaussian 
processes?  The  simplest  possible  situation  is  one  that  was  already  developed 
in  Problem  5.1:  if  X\, . . . ,  Xn  are  i.i.d.  Gaussians,  the  maximal  inequalities 
of  section  5.1  are  sharp.  As  this  elementary  fact  will  form  the  basis  for  all 
further  developments,  let  us  begin  by  giving  a  complete  proof. 

Lemma  6.4.  If  X i, . . .  ,Xn  are  i.i.d.  N(0,<r2)  random  variables,  then 


ca  \/log  n  <  E 


max  X, 

i<.n 


<  <j\J 2  log  n 


for  a  universal  constant  c. 


Proof.  The  upper  bound  follows  immediately  from  Lemma  5.1  (and  does  not 
require  independence).  To  prove  the  lower  bound,  note  that  for  any  5  >  0 


E 

max  A,- 

nOO 

=  /  P 

max  A,  >  t 

dt  +  E 

max  A,;  A  0 

i<n 

Jo 

i<n 

i<n 

>  <5P 


max  A,-  >  S 

i<n  ~ 


+  E[Ar  A  0] 


=  <S{1  -  (1  -  P[A,  >  5])"}  +  E[Xl  A  0], 


as  P[maxj<n  X,  >  t]  is  decreasing  in  t  and  as  {A,}  are  i.i.d.  Now  note  that 

dx  > 


roo  g— x2/2(T2 


P[Ar  >  <5] 


\/27r(72 


Cl 
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for  a  universal  constant  c\,  where  we  used  x 2  =  (x  —  5  +  5)2  <  2(x  —  6 )2  +  2 S2. 
Thus  if  we  choose  the  parameter  S  as 

6  =  aVlogn  —  o\J  log  ci, 

we  have  P[Xl  >  <5]  >  1/n.  This  implies 

E  max X,  >  (1  —  e_1)<7\/log n  —  C2(J 

i<n 

for  a  universal  constant  C2-  Thus  the  result  follows  when  n  >  e4c2^1_e  )  .  On 
the  other  hand,  as  there  are  only  a  finite  number  of  values  n  <  e4°2^1~e  , 

the  lower  bound  trivially  holds  with  some  universal  constant  in  this  case.  □ 

Let  {Xt}t<£T  be  a  random  process  on  a  general  index  set  T.  The  intuition 
behind  the  upper  bounds  developed  in  the  previous  chapter  was  that  while 
Xt  and  Xs  will  be  strongly  dependent  when  t  and  s  are  close  together,  Xt  and 
Xa  can  be  nearly  independent  when  t  and  s  are  far  apart.  This  motivated  the 
approximation  of  the  supremum  by  finite  maxima  over  well  separated  points, 
for  which  the  result  of  Lemma  5.1  might  reasonably  be  expected  to  be  sharp. 
However,  we  never  actually  used  any  form  of  independence  in  the  proofs:  our 
upper  bounds  still  work  even  if  the  intuition  fails.  On  the  other  hand,  we 
can  only  expect  these  bounds  to  be  sharp  if  the  intuition  does  in  fact  hold. 
The  first  challenge  that  we  face  in  proving  lower  bounds  is  therefore  to  make 
mathematical  sense  of  the  above  intuition  that  was  only  used  as  a  guiding 
heuristic  for  obtaining  upper  bounds  in  the  previous  chapter.  This  is  precisely 
what  will  be  done  in  this  section  in  the  setting  of  Gaussian  processes. 

What  should  such  a  result  look  like?  Let  N  be  a  maximal  e-packing  of  T . 
If  {Xt  :  t  £  N}  behave  in  some  sense  like  independent  Gaussians,  then  we 
would  expect  by  Lemma  6.4  that  E[suptgTXt]  >  E[maxtejvXt]  >  i/log  |iV| . 
In  view  of  the  duality  between  packing  and  covering  numbers  (Lemma  5.12), 
this  is  precisely  the  content  of  the  following  result. 

Theorem  6.5  (Sudakov).  For  a  Gaussian  process  {Xt}t^T,  we  have 


E  sup  Xt 

L  teT 

for  a  universal  constant  c. 

Remark  6.6.  Combining  Sudakov’s  lower  bound  with  the  upper  bound  ob¬ 
tained  in  the  previous  chapter  by  chaining,  we  have  evidently  shown  that 


sup  ey'log  N (T,  d,e)  <  E  sup  Xt 
€>o  [ teT 


/log  N(T,  d,  e)  de, 


or,  equivalently  up  to  universal  constants, 
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sup  2~k ,/log  N(T,d,2~k)  <  E 
ke  z 


sup  Xt 
teT 


<^2"VlogiV(T,d,2-fc). 


Thus  the  upper  bound  and  the  lower  bound  we  have  obtained  contain  pre¬ 
cisely  the  same  terms  at  every  scale;  however,  the  upper  bound  is  a  multiscale 
bound  (a  sum  over  all  scales),  while  the  lower  bound  is  a  single  scale  bound 
(a  maximum  over  all  scales).  These  two  bounds  are  not  as  far  apart  as  may 
appear  at  first  sight:  in  many  situations  the  terms  2-fcy/log  N(T,  d ,  2~k)  be¬ 
have  like  a  geometric  series,  so  that  their  sum  is  of  the  same  order  as  the 
largest  term.  There  are  also  many  cases,  however,  where  there  is  indeed  a 
gap  between  these  two  bounds.  The  main  objective  in  the  remainder  of  this 
chapter  will  be  to  close  the  gap  between  these  upper  and  lower  bounds. 

Remark  6.7.  We  have  phrased  Theorem  6.5  in  terms  of  the  covering  numbers 
N(T,d,e)  to  bring  out  the  similarity  between  the  upper  and  lower  bounds. 
It  should  be  emphasized,  however,  that  upper  and  lower  bounds  require  in 
principle  fundamentally  different  ingredients.  Upper  bounds,  which  require 
approximation  of  every  point  in  the  index  set  T,  are  naturally  obtained  in 
terms  of  a  covering  of  T.  On  the  other  hand,  lower  bounds,  which  require  a 
subset  of  T  that  is  well  separated,  are  naturally  obtained  in  terms  of  a  packing 
of  T  (indeed,  it  is  in  fact  the  packing  number  D(T,  d ,  e)  and  not  the  covering 
number  that  arises  in  the  proof  of  Theorem  6.5).  The  duality  of  packing  and 
covering,  while  somewhat  hidden  in  the  statement  of  our  results,  therefore  lies 
at  the  heart  of  the  development  of  matching  upper  and  lower  bounds.  While 
the  duality  between  packing  an  covering  numbers  (Lemma  5.12)  is  elementary, 
the  development  of  a  more  sophisticated  form  of  this  duality  will  prove  to  be 
one  of  the  challenges  that  we  must  surmount  in  our  quest  to  develop  matching 
chaining  upper  and  lower  bounds  for  Gaussian  processes. 

We  now  turn  to  the  proof  of  Theorem  6.5.  The  key  idea  that  we  aim  to 
make  precise  is  that  if  N  is  an  e-packing,  then  the  Gaussian  vector  {Xt}teN 
behaves  in  some  sense  like  a  collection  {Yt}t.^N  of  i.i.d.  Gaussians,  so  that  we 
can  apply  Lemma  6.4.  We  therefore  need  a  tool  that  allows  us  to  compare 
the  maxima  of  two  different  Gaussian  vectors.  To  this  end,  we  will  use  the 
following  classical  comparison  inequality  for  Gaussian  vectors. 

Theorem  6.8  (Slepian-Fernique).  Let  X  ~  iV(0,  Sx)  and  Y  ~  N( 0,  SY) 

be  n-dimensional  Gaussian  vectors.  Suppose  that  we  have 

E| Xi  -  Xj\2  >  E| Yi  -  Yj\2  for  all  i,j  =  1, . . .  ,n. 


Then 


E 

max  Xi 

>  E 

max  Yi 

i<.n 

i<.n 

Using  this  comparison  inequality,  we  can  now  easily  complete  the  proof  of 
Sudakov’s  inequality  by  comparing  with  the  independent  case. 
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Proof  (Theorem  6.5).  Fix  e  >  0  and  an  e-packing  N  of  T  for  the  time  being. 
Define  X  =  {Xt}t6Ari  and  let  Y  =  [Yt}teN  be  i.i.d.  N(0,e2/2)  variables.  Then 

F,\Xt  -  Xs\2  =  d(t,  s)2  >e2  =  E| Yt  -  Ts|2  for  all  t,s  £  N,  t^s. 


Therefore,  we  obtain  using  Theorem  6.8  and  Lemma  6.4 


E 

max  X* 

>  E 

max  Xf 

>  E 

max  Yt 

teT 

teN 

teN 

>  C£\J log  \N\. 


We  now  optimize  over  e  >  0  and  e-packings  N  to  obtain 


E 


max  Xf 
ter 


>  csup eylog D(T,d, e)  >  csupe  i/log  N(T,  d,  e), 
£>0  £>0 


where  we  have  used  Lemma  5.12  in  the  last  inequality. 


We  now  turn  to  the  proof  of  Theorem  6.8.  Let  us  note  that  up  to  this  point, 
we  have  not  used  any  properties  that  are  particularly  specific  to  Gaussian 
processes.  Indeed,  in  Lemma  6.4  we  used  only  a  subgaussian-type  lower  bound 
on  the  tail  probabilities,  and  the  conclusions  of  Theorems  6.5  and  6.8  can 
certainly  hold  also  for  other  types  of  processes.  In  the  proof  of  Theorem  6.8, 
however,  we  will  perform  computations  that  exploit  the  specific  form  of  the 
Gaussian  distribution.  This  is  the  only  point  in  this  chapter  we  will  use  the  full 
strength  of  the  Gaussian  assumption.  The  Gaussian  interpolation  technique 
that  will  be  used  in  the  proof  is  of  interest  in  its  own  right,  and  proves  to  be 
useful  in  many  other  interesting  problems  involving  Gaussian  variables. 

The  idea  behind  the  proof  of  Theorem  6.8  is  as  follows.  We  would  like  to 
prove  that  the  expected  maximum  of  the  vector  Y  is  smaller  than  that  of  the 
vector  X.  Rather  than  proving  this  directly,  we  will  define  a  family  of  Gaussian 
vectors  {Z(t)}te r0)i]  that  interpolate  between  Z( 0)  =  Y  and  Z(  1)  =  X.  To 
establish  Theorem  6.8,  it  then  suffices  to  show  that  the  expected  maximum  of 
Z(t)  is  increasing  in  t.  The  beauty  of  this  approach  is  that  the  latter  problem 
can  be  investigated  “locally”  by  considering  the  derivative  with  respect  to  t. 

Lemma  6.9  (Interpolation).  Let  X  ~  fV(0,rx)  and  Y  ~  7V(0,rY)  be 
independent  n-dimensional  Gaussian  vectors,  and  define 


Z(t)  =  VtX  +  ^fl^t.Y, 
Then  we  have  for  every  smooth  function  f 

j  1  ^ 

E L/W))]  =  X  »V;\  "  j  E 


t  e  [o,i]. 
d2f 


dt 


i,j= 1 


dxidx o 


The  result  of  Lemma  6.9  is  very  closely  related  to  the  computations  that 
we  performed  to  prove  the  Gaussian  Poincare  inequality  in  Example  2.22:  the 
second  derivative  appears  here  for  precisely  the  same  reason  as  it  does  in  the 
generator  of  the  Ornstein-Uhlenbeck  process.  To  prove  Lemma  6.9,  we  require 
a  multidimensional  version  of  the  Gaussian  integration  by  parts  Lemma  2.24. 
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Lemma  6.10  (Gaussian  integration  by  parts).  Let  X  ~  IV (0,17).  Then 


E[XJ(X)]=^riiE 

i=i 


Proof.  Let  Z  ~  IV(0, 1).  Then  X  has  the  same  distribution  as  E1I‘2Z  Thus 

n  n 

E  [Xif(X)}  =  YJZ]l2nZkf{Z1/2Z)}  =Y,Sil2nZk9(Z)}, 


k= 1 


k= 1 


where  3(2:)  =  f(E1/2z).  As  {Zfc}  are  independent,  we  can  apply  the  integra¬ 
tion  by  parts  Lemma  2.24  conditionally  on  {Zj}jjik  to  obtain 


E  [Zkg(Z)\  =  E 


dg 

dzk 


(Z) 


1= 1 


|^(r1/2z) 

OXi 


The  proof  is  easily  completed  as  J2k  -S’,1/2  E^[2  =  E, 


■'ll- 


Using  the  Gaussian  integration  by  parts  property,  the  proof  of  the  inter¬ 
polation  Lemma  6.9  is  now  a  matter  of  straightforward  computation. 

Proof  (Lemma  6.9).  We  readily  compute 


-E[/(Z(i))]  =  X> 


9f  rr7U^  dZj(t) 
dxV  (  ^  dt 


=  I> 


<y 

dxi 


(Z(t)) 


xz  Yj 
Vi  V^i 


As  X  and  Y  are  independent,  we  can  apply  Lemma  6.10  to  the  2n-dimensional 
Gaussian  vector  (A,  Y)  to  compute  the  first  term  on  the  right  as 


E 


S(zw)S 


=  E£5E 

1=1 


a2/ 


(z(t)) 


dxidxj 

An  identical  computation  for  the  second  term  completes  the  proof. 


We  are  now  ready  to  complete  the  proof  of  Theorem  6.8.  Ideally,  we  would 
like  the  proof  to  work  as  follows.  First,  we  define  f(x)  =  maxi<„  x $.  We  then 
use  Lemma  6.9  to  establish  that  under  the  assumptions  of  Theorem  6.8 

ftn.nzm  >  0. 

Then  the  proof  is  complete,  as  this  evidently  implies 
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E 


ma xXi 

i<.n 


E[f(Z(l))}  >  E[/(Z(0))]  =  E 


max  Y,-, 

i<.n 


The  problem  with  this  idea  is  that  the  function  /  is  not  twice  differentiable,  so 
that  we  cannot  apply  Lemma  6.9  directly.  We  can  nonetheless  make  the  proof 
work  by  working  with  a  convenient  smooth  approximation  of  the  function  /. 

Proof  (Theorem  6.8).  Define  for  (3  >  0  the  function 


f0(X)  =  TjiogX!' 


,/3xi 


i= 1 


Then  evidently  (cf.  Problem  5.2) 

max  xt  =  —  log  (  max  e0Xi  )  <  fg(x)  <  —  log  f  nmaxe^Xi  )  =  max^y 

i<n  (j  \  i<n  J  fj  \  i<n  J  i<n 

Thus  — >  maxi<„  Xi  as  (3  — >  oo.  Moreover, 

dfp{x)  _  e0Xi  d2fp(x) 


logn 

~r ■ 


dxi 


e0Xi 

E-=1 


=■  Pi(x), 


dxidxj 


=  (3{5ijPi(x)  -pi(x)pj(x)}. 


Lemma  6.9  therefore  yields 


7  O 

r  e [fp(zm  =  - Zu)v\Pi(zmi-Pi(zm] 


dt 


- 1 -  rb)  e  [Pl{z(f))Po(zm. 


But  noting  that  1  —  Pi(x)  =  'Yhj^Pj(x),  we  can  write 


^2,aiPi{x){l  -Pi(x)}  =  ^2,ai  Pi{x)Pj{x)  =  ^ctj  Pi(x)pj(x), 

i—1  ijtj 

where  we  exchanged  the  roles  of  the  variables  i  and  j.  Averaging  the  two 
expressions  on  the  right  hand  side  and  plugging  into  the  above  identity  yields 

|e [fp{zm  =  |  ^{E|x,  -  X,|2  -  E| Yi  -  Yj I2} E[Pi(Z(t))Pj(Z(t))] 


using  E\Xi  -  Xj  J2  =  E*  -  2 E*  +  E*  and  E| Y<  -  Y^  =  E?  -  2 E*  +  Ej-. 
It  follows  immediately  from  our  assumptions  that  the  right  hand  side  of  this 
expression  is  nonnegative,  so  that  E [fp(Z(t))\  is  increasing  in  t.  Thus 

EfoPO]  =  nfp(Z(  1))]  >  E [fp(zm  =  E [f0(Y)\. 


Letting  (3  — >  oo  in  this  expression  completes  the  proof. 


□ 


156 


6  Gaussian  processes 


The  conclusion  of  the  proof  of  Theorem  6.8  marks  the  last  time  in  this 
chapter  that  we  will  make  explicit  use  of  the  Gaussian  property  of  the  under¬ 
lying  process.  In  the  rest  of  this  chapter,  we  will  only  make  use  of  two  facts 
about  Gaussian  processes:  the  validity  of  Sudakov’s  inequality  (Theorem  6.5), 
and  Gaussian  concentration  (Theorem  3.25).  While  both  these  properties  are 
stronger  than  the  subgaussian  property  used  in  the  previous  chapter,  such 
properties  or  their  variants  do  continue  to  hold  in  many  situations  where  the 
underlying  process  is  not  actually  Gaussian.  For  this  reason,  while  we  will 
concentrate  our  attention  here  on  the  classical  setting  of  Gaussian  processes 
for  concreteness,  the  methods  that  we  are  about  to  develop  prove  to  be  very 
useful  in  a  variety  of  problems  that  go  far  beyond  the  Gaussian  setting. 

Problems 

Problem  6.11  (Norm  of  a  random  matrix).  Let  M  be  an  n  x  m  random 
matrix  such  that  are  independent  IV ( 0, 1)  random  variables.  In  Example 
5.10,  we  used  an  e-net  argument  to  show  that  E||Af||  <  Cy/n  +  in  for  some 
universal  constant  C  (this  conclusion  holds  even  in  the  case  where  the  entries 
Mij  are  only  subgaussian) .  The  goal  of  this  problem  is  to  obtain  some  further 
insight  on  the  norm  of  a  random  matrix  in  the  Gaussian  case. 

a.  The  e-net  argument  only  yields  an  upper  bound  E||M||  <  Cy/n  +  to.  It 
is  far  from  clear,  a  priori,  whether  this  bound  is  sharp.  Use  Sudakov’s 
inequality  to  show  that  in  the  Gaussian  case,  we  have  in  fact  a  matching 
lower  bound  E||Af||  >  Cy/n  +  m  for  some  universal  constant  C' . 

Hint:  consider  the  Gaussian  process  Xv<w  =  ( v,Mw )  on  S'”-1  x  S'1"-1 
(where  S'”-1  is  the  unit  sphere  in  R"),  and  show  that  the  corresponding 
natural  distance  satisfies  d((v,w),  (v',w'))  >  ||u  —  u'||  V  ||u>  —  w/||. 

While  upper  bounds  using  e-net  arguments  or  chaining  often  give  sharp  results 
up  to  universal  constants,  there  is  little  hope  to  obtain  realistic  values  of  the 
constants  in  this  manner.  If  one  cares  about  the  best  values  of  the  constants, 
one  must  typically  resort  to  other  techniqes.  In  the  Gaussian  setting  of  this 
problem,  we  can  use  the  Slepian-Fernique  inequality  as  a  replacement  for  the  e- 
net  argument  to  prove  the  much  sharper  inequality  Ej|AL||  <  ^/n+^/m.  In  fact, 
it  is  known  from  random  matrix  theory  that  this  result  is  sharp  asymptotically 
as  n  oo  with  m  oc  n  (note  that  this  improved  estimate  does  not  contradict 
our  earlier  bounds  as  2_1/2{i Jn  +  y/m}  <  y/n  +  m  <  y/n  +  y/m). 

b.  Let  Z  ~  N(0,In )  and  Z'  ~  N(0,Im)  be  independent  standard  Gaussian 
vectors  of  dimensions  n  and  m,  and  define  for  (v,w)  £  S'"-1  x  Sm~ 1 

XViW  =  ( v ,  Mw),  Yv>w  =  ( v ,  Z)  +  (w,  Z1). 

Show  that  ~Ei\YV'W  —  Yv^Wf\2  >  E| XVtW  —  Xv>tW'\2  for  all  v,v',w,w'. 

c.  Conclude  by  the  Slepian-Fernique  inequality  that  E||M||  <  y/n  +  y/m. 
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Problem  6.12  (Gordon’s  inequality  and  the  smallest  singular  value). 

The  Slepian-Fernique  inequality  is  only  one  of  a  family  of  Gaussian  comparison 
inequalities.  There  is  nothing  terribly  special  about  the  maximum  function — 
the  only  important  property  needed  to  apply  the  interpolation  Lemma  6.9  is 
that  the  second  derivatives  of  the  function  have  the  appropriate  sign. 

In  this  problem,  we  will  develop  another  Gaussian  comparison  inequality 
due  to  Gordon.  To  this  end,  let  X  and  Y  be  n  x  in  matrices  with  centered 
and  jointly  Gaussian  (but  not  necessarily  independent)  entries.  To  obtain  a 
comparison,  we  will  assume  the  following  inequalities  between  the  covariances: 


E  [XijX 

u]  <  E [Y^Yu] 

for  all  i 

E [XijXd  >  E [YyYktl 

for  all  i  ^  k  and 

E  [Xfj]  =  E  \Y%] 

a.  Show  that  for  all  x  G  K. 

for  all 

P 

min  max  Xtj  >  x 

i<n  j <m 

>  p 

min  max  Ytj  >  x 

i<n  j<m 

Hint:  let  ak  :  K.  — >  [0, 1]  be  smooth  and  decreasing  in  x  such  that  ak(x)  — > 
l*<o  as  k  — >  oo.  Apply  Lemma  6.9  to  fk{x)  =  IliLi'C1  IX/Li  ak{xij  -a;)}. 

b.  Conclude  that 


E 


min  max  Xij 

i<n  j<m 


>  E 


min  max  Yj  ,j 

i<.n  j<m 


Let  M  be  an  nxm  random  matrix  with  n  >  in,  such  that  Af.(J  are  independent 
7V(0, 1)  random  variables.  The  minimal  and  maximal  singular  values  of  M  are 
defined  as  the  optimal  constants  smin(M),  smax(M)  in  the  inequality 


Smin(M)  HSU  <  ||Mx||  <  Smax(M)||x||  for  all  X  G  i"“. 


Evidently  smax(M)  =  ||M||,  and  thus  we  obtained  a  sharp  upper  bound  for 
Smax(Af)  using  Slepian’s  inequality  in  the  previous  problem.  Using  Gordon’s 
inequality,  we  can  obtain  a  sharp  lower  bound  for  sm,n(M). 

c.  Use  Gordon’s  inequality  to  show  that  E[smin(M)]  >  y/n—  \Jm. 

Hint:  If  Zn  ~  N(0,In)  is  n-dimensional  standard  normal,  it  can  be  verified 
by  tedious  explicit  computation  that  E||Z„||  —  yjn  is  increasing  in  n. 

Problem  6.13  (Sudakov’s  inequality  and  convex  geometry).  The  proof 
of  Sudakov’s  inequality  that  we  have  given  is  certainly  the  most  intuitive. 
However,  it  relies  on  the  Slepian-Fernique  inequality,  whose  proof  is  based 
on  explicit  Gaussian  computations.  The  goal  of  this  problem  is  to  give  a 
completely  different  proof  of  Sudakov’s  inequality  using  ideas  from  convex 
geometry.  The  fact  that  Sudakov’s  inequality  can  be  proved  by  such  drastically 
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different  means  suggests  that  this  result  is  more  robust  and  less  closely  tied 
to  the  precise  form  of  the  Gaussian  distribution  than  might  appear  from  the 
proof  using  Slepian-Fernique.  In  any  case,  the  connection  between  Sudakov’s 
inequality  and  convex  geometry  is  of  significant  interest  in  its  own  right. 

We  begin  by  reducing  the  problem  to  a  convenient  special  case.  Let  G  = 
{g i, . . . ,  gn}  be  independent  N( 0, 1)  variables,  and  define 

n 

Xt  =  t  £  R". 

fc= l 

Let  T  C  R",  and  consider  the  Gaussian  process  {Xt}tex-  The  natural  distance 
for  this  process  is  simply  the  Euclidean  distance  d(x,y)  =  ||ar  —  y\\. 

a.  Argue  that  to  prove  Theorem  6.5  in  full  generality,  it  suffices  to  consider 
the  special  Gaussian  processes  {Xt}teT  as  defined  above. 

Hint:  for  any  Gaussian  process  {Zu}ueu  and  points  u\,...,un  £  U,  find 
points  £  R”  such  that  {ZUi}i<„  has  the  same  law  as  {Xti}i<n. 

b.  Argue  further  that  it  suffices  to  consider  only  convex  sets  T  C  R™. 

c.  Show  that  for  any  to  £  T 


E 

sup  \Xt  -  Xto\ 

<  2  E 

sup  Xt 

.  teT 

.  teT 

Conclude  that  it  suffices  to  consider  only  symmetric  convex  sets  T  C  R". 

We  now  take  a  rather  surprising  detour  by  proving  an  apparently  quite  differ¬ 
ent  result.  Given  two  convex  sets  A  and  B  in  R™,  let  N(B ,  A)  be  the  smallest 
number  of  translates  of  A  needed  to  cover  B:  that  is, 


N{B1  A)  :=  min 


3a;i, . . 


Xk  £  R"  such  that  B  C  +  A} 


;= l 


We  are  going  to  prove  the  following  inequality: 

2 

P[G  £  A]  >  -  implies  sup  £  \A°g  N(B2,  £  A)  <  c 
3  e>0 

for  some  universal  constant  c,  where  B2  =  {x  £  R”  :  ||*||  <  1}  is  the  Eu¬ 
clidean  unit  ball  and  A  is  any  symmetric  convex  set.  The  proof  of  this  result 
is  one  that  we  are  quite  familiar  with:  we  will  essentially  use  the  same  vol¬ 
ume  argument  as  was  used  in  the  proof  of  Lemma  5.13,  but  we  will  use  the 
Gaussian  measure  P[G  £  A]  to  measure  the  “volume”  of  the  set  A  instead 
of  the  Lebesgue  measure.  The  main  difficulty  is  that  the  Gaussian  measure, 
unlike  the  Lebesgue  measure,  is  not  translation-invariant,  so  we  must  first 
understand  how  to  estimate  the  Gaussian  measure  of  a  translate  of  a  set. 
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d.  Let  A  be  a  symmetric  set.  Show  that 

P[G  G  x  +  A]  >  e-||a:||2/2  P[G  G  A]  for  all  x  G  R". 

Hint:  write  out  the  probability  as  a  Gaussian  integral  and  use  Jensen. 

e.  Let  A  be  a  symmetric  set.  Let  Xi,. . .  ,Xk  G  B2  be  such  that  the  translates 
{x'i  +  eA}  are  disjoint.  Show  that  we  can  estimate 

k 

k  e~1/2e2  P[G  G  A]  <^P[Gg  f  +A]  <  1. 

i- 1 

f.  Let  A  be  a  symmetric  convex  set.  Show  that 

N(B,  2 A)  <  maxjfc  :  3  X\, . . . ,  x^  G  B  s.t.  {x^  +  are  disjoint}. 

Hint:  if  {x  +  A}  fl  {z  +  A}  ^  0,  then  z  £  x  +  A  —  A,  and  thus  z  G  x  +  2A 
as  A  is  symmetric  and  convex  (note  that  A+ A  /  2 A  without  convexity!) 

g.  Conclude  that  if  A  is  a  symmetric  convex  set  and  P[G  G  A]  >  2/3,  then 

supe  \ZlogN(B2,eA)  <  c 
£>0 

for  a  universal  constant  c. 

So  far,  the  supremum  of  the  Gaussian  process  does  not  appear.  Let  us  correct 
this.  Let  T  be  a  symmetric  convex  set,  and  define  its  polar 

T°  :=  {x  G  T  :  (t,  x)  <  1  for  all  t  G  T}. 


Then  evidently 


P[GGoT°] 


p 

supX^  <  a 

>  1-  1 E 

sup  Xt 

.  teT 

a 

.  teT 

by  Markov’s  inequality.  So  if  we  choose  A  =  3E[supteT  Xt\  T°,  we  obtain 


supe  i/logfV(i?2,eT0)  <  3c E  sup Xt  . 
e>o  \_teT 

This  result  is  known  as  the  dual  Sudakov  inequality.  The  covering  number 
on  the  right-hand  side  is  not  the  same  one  that  shows  up  in  the  Sudakov 
inequality:  in  Theorem  6.5,  N(B2,£T°)  is  replaced  by  N(T,d,e)  =  N{T,eB2). 
To  deduce  the  Sudakov  inequality  from  the  dual  Sudakov  inequality,  we  will 
use  a  convex  duality  argument  to  relate  these  two  covering  numbers. 

h.  Show  that  for  every  x  G  R" 

||a:||2  =  (x,x)  <  sup (t,x)  sup  (t,x). 

t&T  t£T° 

Hint:  note  that  x/  sup teT{t,x)  G  T°. 
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2 

i.  Conclude  from  the  previous  part  that  2 T  D  \T°  C  eB2,  and  therefore 

N(T,  eB2)  <  N(T,  2 T  n  ^T°)  =  N(T,  e^T°). 

j.  Show  that 

N(T,  sB2)  <  N(T,  2 eB2)  N(2eB2,  ^T°). 

2 

Hint:  construct  a  cover  of  T  by  translates  of  \T°  by  first  covering  T  by 

2 

translates  of  2eB2 ,  then  covering  each  of  the  latter  by  translates  of  %T°. 

k.  Conclude  that 

sup  \/log  N(T,  eB2)  <  8  sups  \/\og  N(B2,eT°), 

£>0  £>0 

so  that  the  Theorem  6.5  follows  from  the  dual  Sudakov  inequality. 
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In  the  previous  section  we  made  a  first  step  towards  proving  lower  bounds  for 
the  suprema  of  Gaussian  processes:  we  showed  how  one  can  make  precise  the 
intuition  that  well-separated  points  behave  like  independent  variables.  This 
allows  us  to  obtain  a  lower  bound  in  terms  of  the  covering  number  at  a  single 
scale.  However,  in  the  upper  bound  we  obtained  by  chaining,  we  necessarily 
must  deal  with  infinitely  many  scales  in  order  to  eliminate  the  remainder 
term  in  the  chaining  method.  In  order  to  close  the  gap  between  our  upper 
and  lower  bounds,  our  second  challenge  is  therefore  show  how  to  obtain  a 
multiscale  lower  bound.  We  will  presently  show  how  this  can  be  done. 

Let  us  recall  the  basic  step  in  the  chaining  method:  if  diam(T)  <  e  and  if 
N  C  T  is  an  e/2-net,  then  we  have  for  some  universal  constant  Ci 


E 


sup  Xt 
teT 


<  ci£\/log  |  IV  |  +  E 


sup{Xt 

teT 


This  yields  the  contribution  at  a  single  scale  e,  plus  a  remainder  term.  By 
iterating  this  bound,  we  can  eliminate  the  remainder  term  and  obtain  a  sum 
at  infinitely  many  scales.  To  obtain  a  matching  lower  bound,  we  would  like  to 
mimick  this  procedure  in  the  reverse  direction.  In  order  to  do  this,  we  would 
like  to  have  an  inequality  of  the  following  form:  if  N  C  T  is  an  e-packing,  then 


E 


sup  Xt 
_  teT 


>  c2ey/log  \N\  +  a  remainder  term 


for  some  universal  constant  c2.  In  the  absence  of  the  remainder  term,  this 
is  precisely  Sudakov’s  inequality  proved  in  the  previous  section.  However, 
without  the  remainder  term,  our  lower  bound  necessarily  terminates  at  a 
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single  scale.  On  the  other  hand,  if  we  could  prove  an  improvement  of  Sudakov’s 
inequality  that  includes  a  remainder  term  (hopefully  of  a  similar  form  to  the 
one  that  appears  in  the  chaining  upper  bound),  then  it  becomes  possible  to 
iterate  this  inequality  to  obtain  a  multiscale  lower  bound.  In  essence,  our  aim 
is  to  develop  an  improved  version  of  Sudakov’s  inequality  that  will  allow  us  to 
run  the  chaining  argument  in  reverse!  This  is  the  idea  of  the  following  result. 

Theorem  6.14  (Super-Sudakov).  Let  {Xt}t£T  be  a  separable  Gaussian 
process  and  let  N  be  an  e-packing  of  (T,  d) .  Then  we  can  estimate 


E 

sup  Xt 

>  ce\A° g  |-W|  +  minE 

sup  Xt 

L  teT  J 

sGN 

_t£B(s,CZ£) 

where  c  and  a  <  \  are  universal  constants  and  B(s,  e)  :=  {t  €  T  :  d(t,  s)  <  e}. 
The  geometry  of  Theorem  6.14  is  illustrated  in  the  following  figure: 


The  set  T  (large  circle)  is  packed  with  points  at  distance  e;  around  each  point 
in  the  packing,  we  consider  the  set  of  parameters  in  a  ball  with  radius  ae 
(small  circles).  The  supremum  of  the  process  over  the  entire  set  is  estimated 
from  below  by  the  lower  bound  obtained  by  applying  Sudakov’s  inequality 
to  the  e-packing,  plus  a  remainder  term  which  corresponds  to  the  smallest 
expected  supremum  of  the  process  over  one  of  the  disjoint  balls. 

The  proof  of  Theorem  6.14  is  not  difficult.  It  will  be  deduced  directly  from 
Sudakov’s  inequality,  together  with  the  following  basic  consequence  of  the 
Gaussian  concentration  principle  (Theorem  3.25). 

Lemma  6.15  (Concentration  of  suprema).  Let  {Xt}t^T  be  a  separable 
Gaussian  process.  Then  suptgT  Xt  is  suptgT  Var [Xt]-subgaussian. 

Proof.  By  separability,  we  can  approximate  the  supremum  over  T  by  the 
supremum  over  a  finite  set  (cf.  the  proof  of  Theorem  5.24).  It  therefore  suffices 
to  prove  the  result  for  the  maximum  maxi<„  X,  of  an  n-dimensional  Gaussian 
vector  X  ~  N( 0,27).  It  is  convenient  to  write  X  =  YX^2Z  for  Z  £  N(0,I). 
It  then  follows  from  Theorem  3.25  that  maxj<„  Xi  is  ||  V/H^-subgaussian, 
where  we  have  defined  the  function  f(z )  :=  maxi<n  (E1/2z)i.  Note  that 
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i*  (z)i’ 


3=1 


where  we  defined  i*(z)  :=  arg ma3ti<n(E1/2z)i.  Thus 


I  V/(*) 


=  s'JLxlL2,^  =  r. 


*  (2) 


i*(z)i*(z) 


i=l 

As  Ai,;  =  Var[Xj],  the  result  follows  immediately. 

We  now  complete  the  proof  of  Theorem  6.14. 
Proof  (Theorem  6.1  f).  We  can  evidently  estimate 


<  maxXi,;. 
£<n 


E 

sup  Xt 

>  E 

max  sup  Xt 

.  teT 

.  sGN  teB(s,as ) 

E 

max  <  Xs  +  E 

sup  Xt 

+  ES1 

seiv  [ 

.t^B(s,a£) 

JJ 

>  E 

max  Xs 

+  min  E 

sup  Xt 

-  E 

max{— Ys\ 

s£N 

seN 

_t((zB(s,Qt£ ) 

s£N 

where  we  defined 


□ 


Ys  =  sup  {Xt 

t£B(s,Oi£) 


xs}  -  E 


sup  {Xt 

t£B(s,aie) 


By  Lemma  6.15,  Ys  is  a2e2-subgaussian  for  all  s  £  N.  Thus  we  obtain,  bound¬ 
ing  the  first  term  using  Theorem  6.5  and  the  last  term  using  Lemma  5.1, 


E 


sup  Xt 
t£T 


>  {c  —  aV2\e\/\og\N\  +  min E 

seN 


sup  Xt 

tdzB(s,Qi£ ) 


for  some  universal  constant  c.  Choosing  a  =  c/2\/2  completes  the  proof.  □ 

Let  us  compare  the  lower  bound  of  Theorem  6.14  to  the  chaining  upper 
bound.  An  immediate  difference  between  the  two  bounds  is  that  the  former  is 
stated  in  terms  of  an  e-packing,  while  the  latter  is  in  terms  of  an  e-net.  This 
will  be  taken  care  of  using  the  duality  between  covering  and  packing,  however, 
so  that  this  difference  is  not  a  major  concern  at  this  stage.  A  more  pressing 
concern  is  the  minimum  in  the  bound  of  Theorem  6.14.  To  emphasize  this 
issue,  let  us  reformulate  the  chaining  upper  bound  to  bring  out  the  similarity 
between  the  two  bounds:  if  diam(T)  <  e  and  N  C  T  is  an  ae-net,  then 


E 


sup  Xt 

teT 


<  Cie^log  |iVj  +  E 


max 

sGN 


<  <L  £ \/log  liVj  +  max  E 

sGN 


sup  [Xt 

t(zB(s,Ct£ ) 

sup  Xt 

t(zB(s,QC£ ) 
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The  first  inequality  follows  trivially  from  the  chaining  upper  bound  as  stated 
at  the  beginning  of  this  section,  while  the  second  bound  is  readily  obtained 
by  using  Gaussian  concentration  as  in  the  proof  of  Theorem  6.14.  In  contrast, 
the  bound  of  Theorem  6.14  states  that  if  N  is  an  ^-packing,  then 


E 

sup  Xt 

>  cc \/log  \N  +  minE 

sup  Xt 

.  teT 

sGN 

_tdzB(s,Qi£ ) 

When  phrased  in  this  manner,  the  two  bounds  appear  to  be  very  similar,  with 
one  crucial  difference:  in  the  chaining  upper  bound,  the  remainder  term  is  the 
largest  expected  supremum  of  the  Gaussian  process  over  a  ball  centered  at  one 
of  the  points  in  N,  while  the  remainder  term  in  Theorem  6.14  is  the  smallest 
expected  supremum  over  such  a  ball.  There  is  no  reason  why  the  supremum 
of  the  Gaussian  process  over  two  balls  of  the  same  radius  should  be  of  the 
same  order:  in  general,  the  remainder  terms  in  our  upper  and  lower  bounds 
can  be  of  a  very  different  order  of  magnitude.  The  major  remaining  question, 
to  be  addressed  in  the  next  section,  is  how  to  overcome  this  problem. 

For  the  time  being,  however,  we  would  like  to  illustrate  the  idea  of  chaining 
in  reverse  without  having  to  cope  with  the  complications  arising  from  the 
above  problem.  To  this  end,  we  will  investigate  in  the  remainder  of  this  section 
a  special  class  of  Gaussian  processes  for  which  this  problem  does  not  arise. 

Definition  6.16  (Stationary  Gaussian  process).  The  Gaussian  process 
{Xt}t£T  is  called  stationary  if  there  exists  a  group  G  acting  on  T  such  that 

1.  d(g(t),g(s))  =  d(t,s)  for  all  t,s  £  T,  g  £  G  (translation  invariance). 

2.  For  every  t,s  £  T,  there  exists  g  £  G  such  that  t  =  g(s)  (transitivity). 

Of  course,  the  key  point  of  this  definition  is  that  for  a  stationary  Gaussian 
process  all  balls  are  created  equal:  indeed,  we  have  equality  in  distribution 

{Xt  —  Xs  :  t  £  B{s ,  e)}  =  {Xt  -Xs,:t£  B(s',  e)}  for  all  s,  s'  £  T. 

To  see  this,  recall  that  the  law  of  the  increments  of  a  Gaussian  process  is 
entirely  determined  by  the  natural  metric  d ,  and  note  that  if  g  £  G  is  such 
that  s'  =  g(s),  then  g  maps  B(s,e)  isometrically  onto  B(s',e).  Thus 


max  E 

sST 


sup  Xt 

t€B(s,e) 


=  min  E 

sGT 


sup  Xt  , 

t£B(s,e) 


so  our  upper  and  lower  bounds  are  of  the  same  order  in  this  case. 

Example  6.17  (Brownian  motion).  Let  {Bt}t£R  be  two-sided  Brownian  motion 
(that  is,  Bt  =  B't  for  t  >  0  and  Bt  =  B"_t  for  t  <  0,  where  {£>(}t>o  and 
{B"}t>o  are  independent  standard  Brownian  motions).  We  can  view  the  index 
set  ffi.  itself  as  a  group  G  =  (K,  +)  under  addition.  It  is  now  easily  seen  that 
Brownian  motion  is  a  stationary  Gaussian  process:  transitivity  is  obvious, 
while  translation  invariance  can  be  read  off  from  d(t ,  s)  =  y/\ t  —  sj. 
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Example  6.18  (Random  Fourier  series).  A  classical  application  of  stationary 
Gaussian  processes  is  to  develop  an  understanding  of  Fourier  series  with  ran¬ 
dom  coefficients.  Let  <?&  and  g'k  be  i.i.d.  _/V(0, 1)  random  variables,  and  let  c*, 
be  coefficients  such  that  J2k  ck  <  00 •  Define  for  t  £  S1  =  [0,  27t[  the  process 


Xt  =  ^2  Ck{gk  sin  kt  +  g'k  cos  kt}. 

k= 0 


Then  {At}tesi  is  a  stationary  Gaussian  process  for  the  group  of  rotations  of 
the  circle  S1.  Indeed,  transitivity  is  obvious,  and  is  it  not  difficult  to  compute 
d(t,  s)2  =  2  c2{l  —  cos(fc(f  —  s))}  which  is  evidently  translation-invariant. 

Under  the  stationarity  assumption,  we  have  seen  that  the  upper  bound 
we  have  used  in  a  single  iteration  of  the  chaining  argument  is  matched  by  an 
essentially  equivalent  lower  bound.  Therefore,  in  this  setting,  we  expect  that 
the  chaining  bound  obtained  in  the  previous  chapter  is  tight.  To  prove  this, 
little  remains  but  to  run  the  chaining  argument  in  reverse. 


Theorem  6.19  (Fernique).  Let  {Xt} teT  be  o.  stationary  separable  Gaussian 
process.  Then  we  can  estimate  for  some  universal  constants  ci,C2 


Cl 


y/log  N(T,  d,  e)  de  <  E 


sup  Xt 
t£T 


y/log  N(T,  d,  e)  de. 


Proof.  As  the  Gaussian  process  is  stationary,  all  balls  behave  in  the  same  way. 
Thus  we  will  lighten  our  notation  by  defining  B(e)  =  B(to,e)  for  some  fixed 
by  arbitrary  point  to  £  T.  This  will  play  the  role  of  our  “representative  ball”. 

Let  us  begin  by  applying  Theorem  6.14  at  the  scale  an.  Choose  Nn  to  be 
a  maximal  an+2-packing  of  the  ball  B( an+1).  Then  we  have 


U  B(s,an+3)CB(an), 

seNn 


as  d(t0lt)  <  d(to,s)  +  d(s,t)  <  an+1  +  an+3  <  an  for  every  s  £  Nn  and 
t  £  B(s,  an+3).  This  situation  is  illustrated  in  the  following  figure: 
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By  the  maximality  of  the  packing  Nn,  the  duality  between  packing  and  cov¬ 
ering  numbers  yields  |iVn|  >  N(B(an+1),  d,  an+2).  Thus  Theorem  6.14  yields 


E 

sup  Xt 

>  can+2  \/\og  N(B(an+1),  d,  an+2)  +  E 

sup  Xt 

_t£B(an) 

.  tes(a’l+3) 

where  we  have  used  stationarity  and  B(s,an+3)  C  B(an )  to  conclude  that 


min  E 

sSiV„ 


sup  Xt 

=  E 

sup  Xt 

.tEB(an)r\B(s,ari+3) 

_t£B(ari+3) 

(the  term  on  the  left  being  the  one  that  arises  in  Theorem  6.14). 

We  now  iterate  this  bound.  Let  ko  be  the  largest  integer  such  that  ak°  > 
diam(T).  If  we  start  the  iteration  at  any  n  <  ko,  then  we  obtain 

>  c  ^2  an+3k+2  ^/log  N (B(an+3k+1) ,  d,  an+3k+2) . 

fc>  0 

This  completes  the  core  part  of  the  proof  of  Theorem  6.19:  we  have  obtained  a 
multiscale  lower  bound  on  the  supremum  of  the  Gaussian  process  by  “chaining 
in  reverse” .  However,  at  first  sight  the  lower  bound  looks  a  little  different  than 
the  upper  bound  of  Theorem  5.24.  The  difference  proves  to  be  cosmetic,  and 
we  will  presently  “fix”  the  discrepancy  between  the  two  bounds. 

First,  note  that  the  terms  in  the  above  sum  “skip”  from  scale  ak  to  ak+3, 
rather  than  summing  over  all  k  £  Z.  As  the  starting  point  n  is  arbitrary, 
however,  we  can  fix  this  by  averaging  over  n  =  Ato  ,  ko  —  1,  ko  —  2.  This  yields 

>  ^  ^  ak+1\J log  N(B{ak),d,  ak+1). 
fee  z 

The  remaining  problem  with  this  lower  bound  is  that  it  contains  covering 
numbers  of  the  form  N(B(ak),d,ak+1),  while  our  upper  bound  is  phrased  in 
terms  of  covering  numbers  of  the  entire  set  N(T,d,ak+1).  To  fix  this,  let  us 
do  some  covering  number  gymnastics.  Suppose  we  can  cover  T  by  m  balls  of 
radius  ak,  and  that  each  ball  of  radius  ak  can  be  covered  by  m!  balls  of  radius 
ak+1 .  Then  clearly  T  can  be  covered  by  mm'  balls  of  radius  ak+1.  We  can 
choose  m  =  N(T,d,ak)  and  m!  =  N(B(ak),d,ak+1)  (using  stationarity  to 
argue  that  the  covering  number  of  any  ball  B(s,ak)  is  equal  to  that  of  our 
representative  ball  B(ak)).  A  moment’s  reflection  will  show  that  we  proved 

N(T,  d,  ak+1)  <  N(T,  d,  ak)  N(B(ak),d,  ak+1). 

This  sort  of  reasoning  is  useful  in  many  problems  involving  covering  numbers. 
In  the  present  setting,  plugging  this  identity  into  the  above  bound  yields 


E 


sup  Xt 

teT 


E 


sup  Xt 

tGT 
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>l'52(xk+1\A°gN{T,d,°fi+1)  -^at+VlogJV(T)(i,a‘) 

fee z  feez 

=  r" ''  E  «*+1  \/los d ’ afc+1) 

kez 

POO 

>  c1  /  i/log  IV  (T,  d,  e)  cfe 

J  o 

for  some  universal  constant  c',  where  we  estimated  the  sum  by  an  integral 
in  the  usual  manner  (cf.  Problem  5.9).  Note  that  in  order  to  prove  that  the 
two  terms  in  the  first  inequality  are  of  the  same  order,  we  used  the  fact  that 
the  sum  runs  over  all  k  £  Z  and  not  just  over  multiples  of  three.  This  minor 
annoyance  in  the  proof  therefore  does  serve  a  purpose. 

We  have  now  proved  the  lower  bound.  The  corresponding  upper  bound 
follows  immediately  from  the  previous  chapter  (Corollary  5.25).  □ 


E 


sup  Xt 
teT 


Problems 

6.1  (An  alternative  proof  of  super- Sudakov).  We  deduced  the  super- 
Sudakov  inequality  from  the  ordinary  Sudakov  inequality  together  with  Gaus¬ 
sian  concentration.  It  is  also  possible,  however,  to  obtain  Theorem  6.14  directly 
from  the  Slepian-Fernique  inequality  by  modifying  the  proof  of  the  Sudakov 
inequality.  The  advantage  of  this  is  that  it  yields  somewhat  sharper  constants. 
The  aim  of  this  problem  is  to  develop  this  alternative  proof. 

For  simplicity,  let  {Xt}teT  be  a  Gaussian  process  on  a  finite  index  set  T 
(the  extension  to  the  case  of  a  separable  Gaussian  process  follows  readily  as 
in  the  proof  of  Theorem  5.24).  Let  N  be  an  e-packing  of  (T,d). 

a.  For  every  s  €  N,  let  Ts  :=  {t  £  T  :  d(t,  s)  <  \e}  and 

Zt  =  x[s)  -  X +  \egs  for  t  £  Ts,  s  £  N, 

where  {x[s^}teT  are  independent  copies  of  {Xt}t£T  and  gs  are  independent 
N( 0, 1)  random  variables  for  s  £  N.  Show  that  we  have 

E|Xt  -  Xv\2  >  E|  Zt  -  Zv  |2  for  all  t,t'  £  |J  Ts. 

seN 

b.  Conclude  from  Theorem  6.8  that 


E 

sup  Xt 

>  E 

max  ljgs+  sup{Xt(s)  - 

.  teT 

[sePf  (4  teTs  JJ 

c.  Use  Jensen’s  inequality  conditionally  on  {gs}seN  to  conclude  that 


E 

sup  Xt 

>  -  E 

max  qs 

+  min  E 

sup  Xt 

.  teT 

4 

seN 

s£N 

.teTs 

and  conclude  that  Theorem  6.14  holds  for  a= 
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6.2  (Rectangles).  Consider  the  Gaussian  process  ,u*»  of  the  form 

n 

Xf  ^  '  {Jh'tkMk'  ■ 

fc=l 


where  ai  >  ■  ■  ■  >  an  >  0  are  given  constants  and  g%, . . . ,  gn  are  i.i.d.  N( 0, 1). 
Such  a  process  is  called  a  rectangle  (as  the  index  set  ({— 1,1}",  d)  has  the 
same  geometry  as  the  corners  of  a  rectangle  in  (R",  ||  •  ||)). 


a.  Show  that 


E 


sup  Xt 

te{-i,i}™  J 


n 


k= 1 


b.  Argue  that  {Xt}tef  1?1|n  is  a  stationary  Gaussian  process,  so  that 

n 

\/log  N({-1,  !}n>  dj e)  de  x  y^afc. 

k=l 

c.  Attempt  to  verify  this  conclusion  by  estimating  covering  numbers  and  com¬ 
puting  the  entropy  integral  directly.  (This  is  surprisingly  hard!) 


d.  Let  ak  =  1/k.  Show  that  for  every  n  >  1 


sup  e \/log  JV({ — 1,  1}",  d,  e)  <  c  and 
£>0 


^2  ctk  >  log  n 

k= 1 


for  some  universal  constant  c.  Therefore,  while  the  chaining  bound  of  The¬ 
orem  5.24  is  sharp,  Sudakov’s  inequality  is  far  from  sharp  in  this  example. 


6.3  (A  nonstationary  process).  Consider  the  Gaussian  process  { Xn 

Y  — 

Vl  +  logn’ 


where  {.g„}„6N  are  i.i.d.  N( 0, 1).  This  process  is  most  definitely  not  stationary, 
a.  Show  that 


E 


sup  Xn 

new 


<  00. 


b.  Show  that 

y/log  N(N,  d,  e)  de  =  00, 

so  the  conclusion  of  Theorem  6.19  can  indeed  fail  in  the  nonstation  ary  case. 
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c.  To  gain  some  insight  into  the  problem,  compute  the  quantity 


E 


sup  Xn 

d(n,m)<e 


for  different  rn  £  N.  Conclude  that  while  one  needs  N( N,  d,  e)  balls  of  radius 
e  to  cover  N  (and  N(N,d,e)  f  oo  as  £  j  0),  the  expected  supremum  of  the 
Gaussian  process  over  all  but  one  of  these  balls  vanishes.  Thus  the  remain¬ 
der  terms  in  our  chaining  upper  and  lower  bounds  are  not  comparable  (in 
fact,  in  this  case  it  is  clearly  the  upper  bound  that  it  inefficient). 

6.4  (An  improved  chaining  argument).  Let  { Xt  }teT  be  a  (nonstationary) 
Gaussian  process.  In  order  to  compare  the  super-Sudakov  inequality  to  the 
chaining  upper  bound,  we  used  Gaussian  concentration  to  reformulate  the 
upper  bound  as  follows:  if  diam(T)  <  e  and  ACT  is  an  a£-net,  then 


E 

sup  Xt 

<  c£ \/log  \N\  +  max  E 

sup  Xt 

.  teT 

seN 

.teB(s,cts) 

To  goal  of  this  problem  is  to  note  that  chaining  using  this  improved  inequality 
will  in  fact  yield  a  slightly  improved  version  of  Corollary  5.25: 


E 


sup  Xt 
.  teT 


<  Cl  sup 
teT 


i/log  N(B(t,  c2s),  d,  e)  de 


for  universal  constants  Ci,C2  >  1. 

a.  Prove  the  above  inequality. 


b.  Find  an  example  where  this  inequality  is  sharp,  but  Corollary  5.25  is  not. 

Hint:  let  T  be  a  (not  necessarily  regular)  finite  rooted  tree  with  root  to  G  T 
and  leaves  T  C  T.  Assume  that  all  leaves  have  the  same  depth  n.  For  every 
leaf  tGT,  denote  by  Tr0(t) ,  TTi(t) , . . . ,  7r n(t)  the  unique  path  in  the  tree  from 
the  root  7To(t)  =  to  to  the  leaf  7r„(t)  =  t.  Attach  to  each  vertex  s  £  T  an  i.i.cl. 
A(0,1)  random  variable  and  define  {Xt}teT  as  Xt  =  ]Cfc=o /3k£nk(t)- 
Choose  (3  <  1  and  an  irregular  tree  T  carefully  to  construct  the  example. 

c.  Find  an  example  where  also  the  present  inequality  is  not  sharp. 

Hint:  consider  Problem  6.3. 


6.3  The  majorizing  measure  theorem 

In  the  previous  section  we  developed  the  machinery  needed  to  run  the  chaining 
argument  in  reverse.  However,  our  upper  bound  involved  a  maximum  over  the 
expected  supremum  of  different  balls,  while  our  lower  bound  involved  a  min¬ 
imum  over  the  expected  supremum  of  different  balls.  In  the  stationary  case, 
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these  quantities  are  of  the  same  order  and  we  were  able  to  run  the  chaining 
argument  to  its  completion.  In  the  general  case,  however,  the  supremum  over 
different  balls  of  the  same  radius  can  be  of  a  very  different  order  of  magnitude, 
and  thus  our  upper  and  lower  bounds  do  not  match.  To  close  this  gap,  it  will 
be  essential  to  take  the  inhomogeneity  of  the  process  into  account. 

In  this  section,  we  will  develop  our  most  efficient  incarnation  of  the  chain¬ 
ing  method  that  achieves  precisely  this  goal.  There  are  two  problems  to  be 
overcome.  First,  we  must  understand  how  to  obtain  matching  upper  and  lower 
bounds  at  the  level  of  a  single  iteration  of  the  chaining  argument.  This  will 
prove  to  be  surprisingly  straightforward:  we  have  already  encountered  most 
of  the  ideas  in  the  previous  section,  and  it  remains  to  note  that  they  can  be 
implemented  more  efficiently.  Next,  we  must  understand  how  to  iterate  these 
inequalities  so  that  we  ultimately  obtain  matching  upper  and  lower  bounds. 
This  will  prove  to  be  the  most  clever  part  of  the  argument,  and  we  will  see 
that  we  must  organize  the  chaining  argument  carefully  in  order  to  retain  the 
duality  between  packing  and  covering  at  different  scales.  The  payoff,  however, 
will  be  a  remarkable  achievement:  a  complete  understanding  of  the  expected 
supremum  of  a  Gaussian  process  in  terms  of  chaining!  With  that  accomplish¬ 
ment  to  look  forward  to,  let  us  proceed  to  making  it  happen. 

Our  first  step  is  a  seemingly  innocuous  observation.  In  the  super-Sudakov 
inequality  of  Theorem  6.14,  we  could  choose  N  to  be  any  £-packing.  If  we  did 
not  have  the  remainder  term,  then  the  best  possible  bound  would  be  obtained 
by  choosing  a  maximal  packing,  as  we  did  in  the  Sudakov  inequality  of  Theo¬ 
rem  6.5.  However,  in  the  super-Sudakov  inequality,  this  is  not  necessarily  the 
best  idea:  if  we  increase  the  size  of  the  packing,  then  evidently  the  size  of  the 
remainder  term  will  decrease,  and  thus  we  could  “miss”  important  parts  of 
the  index  set  that  will  arise  in  a  later  iteration  of  the  chaining  argument.  By 
resisting  the  temptation  to  be  greedy,  we  obtain  an  immediate  improvement 
of  the  super-Sudakov  inequality  without  any  additional  effort. 

Corollary  6.20  (Super-Sudakov  improved).  Let  {Xt}tGT  be  a  separable 
Gaussian  process  and  let  N  =  {t±, ,  tr}  be  an  e-packing  of  (T,  d).  Then 


E 


sup  Xt 

t£T 


>  min  max 

<7  k<.r 


|ceVlogo-(fc)  +  E 


sup  Xt 
t€.B(tk,ote) 


where  the  minimum  is  over  all  permutations  a  of  {1, ...  ,r}. 

While  we  have  phrased  this  result  as  a  minimum  over  permutations  for 
aesthetic  reasons,  note  that  it  is  clear  what  is  the  optimal  permutation:  it  is 
given  by  cr(fcj)  =  i  if  we  rank  the  remainder  terms  in  decreasing  order 


sup  Xt 

>  E 

sup  Xt 

>  • 

>  E 

sup  Xt 

_t£B(tkl,ae) 

.teS(tfc2,ae) 

_t(zB(tkr,ae) 

Thus  the  permutation  a  captures  precisely  the  inhomogeneity  of  the  process: 
“fatter”  balls  B(tk,as )  end  up  with  smaller  labels  cr(fc). 
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Proof.  Sort  the  packing  N  =  {t,^ , . . . ,  tkr}  as  indicated  above.  If  we  apply 
Theorem  6.14  to  the  smaller  packing  {t^, . . .  ,tke}  only,  we  evidently  obtain 


E 

sup  Xt 

>  ce\J log£  +  E 

sup  Xt 

.  teT 

.tEB(tkt,ae) 

for  any  £  <  r. 


The  result  follows  immediately  by  optimizing  this  bound  over  l.  □ 

It  might  be  unclear  at  this  point  that  we  have  made  significant  progress. 
Indeed,  while  we  now  capture  the  inhomogeneity  of  the  Gaussian  process 
in  the  lower  bound,  we  have  essentially  just  rearranged  our  previous  lower 
bound  without  making  any  fundamental  improvement.  In  particular,  we  are 
still  far  removed  from  our  chaining  upper  bound.  However,  now  that  we  have 
reformulated  our  lower  bound  in  this  illuminating  manner,  it  will  quickly 
become  clear  that  it  is  in  fact  the  upper  bound  that  is  inefficient  and  fails  to 
capture  the  inhomogeneity  of  the  process.  We  will  presently  correct  this. 

Proposition  6.21  (Super-chaining).  Let  {Xt}tex  be  a  separable  Gaussian 
process.  //diam(T)  <  e  and  {Ai, . . .  ,Ar}  is  a  partition  ofT,  then 


E 


sup  Xt 
.  teT 


<  min  max 
<y  k<r 


|3e{l  +  y/loga(/c)}  +  E 


sup  Xt 
.teAk 


The  improved  upper  bound  of  Proposition  6.21  captures  the  inhomogeneity 
of  the  Gaussian  process  in  a  completely  analogous  manner  to  the  lower  bound 
of  Corollary  6.20.  To  prove  this  result,  we  must  eliminate  the  inefficiency  in 
the  proof  of  our  previous  upper  bound.  Somewhat  surprisingly,  it  turns  out 
that  this  inefficiency  arises  in  the  very  first  result  we  proved  about  maxima  of 
random  variables:  Lemma  5.1.  The  following  apparently  minor  improvement, 
which  is  proved  using  a  simple  union  bound,  yields  precisely  what  we  need. 

Lemma  6.22.  Let  Z\, . . . ,  Zn  be  a1  -subgaussian  random  variables.  Then 


E 


ma x{Zk  -  E [Zk\  -  2a^/logk} 

k<.n 


<  3(7. 


Proof.  We  can  assume  without  loss  of  generality  that  E[Z/f  =  0  for  all  k. 
Using  a  union  bound  and  the  subgaussian  property,  we  evidently  have 


P 


ma x{Zk  —  2(j\/log  k}  >  t 

k<.n 


n 

<Ept^  >  2(j  \/log  k  +  t] 
k= 1 


We  therefore  estimate 


E 


ma x{Zfc  —  2<7\/log  k} 

k<.n 


^5/2 

6^2 


a. 


For  simplicity  we  estimate  the  ugly  constant  7r5/2/6y/2  ss  2.06  by  3. 


□ 
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We  can  now  complete  the  proof  of  Proposition  6.21. 

Proof  (Proposition  6.21).  Fix  any  to  £  T.  As  E[Xto]  =  0,  we  can  estimate 


E 


sup  Xt 
.  teT 

=  E 

max  sup  {Xt-  Xt(} } 

.  k<r  tGAk 

=  E 

max  <  2e\/log  k  +  E 

k<r  1 

sup  Xt 
-tGAk 

+  {Yk  -  2s\/log  fc}| 

where  we  have  defined 


Yk 


sup  {Xt  —  Xto}  —  E  sup  {Xt  -  Xto } 

t£Ak  _  tG-Afc 


As  d(t,to)  <  diam(T)  <  e,  the  random  variables  Yk  are  e2-subgaussian  by 
Lemma  6.15.  Thus  Lemma  6.22  immediately  yields 


E 


max{Yfc  —  2e\/log  k} 

k<.r 


<  3e, 


and  thus  we  obtain 


E 


sup  Xt 

t£T 


<  max 

k<r 


|3e{l  +  y/logfc}  +  E 


sup  Xt 
_teAk 


But  note  that  this  result  holds  for  any  ordering  of  {Ai, . . . ,  Ar}.  Replacing  A, 
by  Aa- i(j)  and  optimizing  over  permutations  a  concludes  the  proof.  □ 

Up  to  the  duality  between  packing  and  covering,  we  have  now  essentially 
obtained  matching  upper  and  lower  bounds  in  Corollary  6.20  and  Proposition 
6.21  for  a  single  iteration  of  the  chaining  argument.  We  have  therefore  finally 
reached  a  point  at  which  it  should  no  longer  appear  to  be  a  major  miracle  that 
we  can  obtain  matching  upper  and  lower  bounds  on  supremum  of  a  Gaussian 
process.  However,  these  bounds  will  be  necessarily  more  sophisticated  than 
in  Theorem  5.24,  as  we  must  now  explicitly  keep  track  of  the  inhomogeneity 
of  the  process  in  each  iteration  of  the  chaining  argument.  In  particular,  it 
is  no  longer  sufficient  just  to  choose  any  sequence  of  coverings  of  the  index 
set  T  at  different  scales:  we  must  sort  each  of  the  covers  in  accordance  with 
the  permutations  a  in  Corollary  6.20,  which  should  be  thought  of  as  ranking 
the  elements  of  the  cover  in  order  of  decreasing  “fatness” .  This  requires  some 
amount  of  bookkeeping,  which  can  be  done  in  different  ways.  The  device  that 
we  will  choose  for  this  purpose,  given  in  the  following  definition,  is  designed 
to  be  as  close  as  possible  to  the  statement  of  Proposition  6.21. 

Recall  that  an  increasing  sequence  of  partitions  {An}nez  is  a  family  of 
partitions  An  such  that  every  B  £  An+\  is  contained  in  some  set  A  €  An. 
The  set  of  children  of  a  set  A  £  An  is  denoted  c(A)  :=  {B  £  An+\  :  B  C  A}. 
For  any  t  £  T,  we  denote  by  An(t)  the  unique  set  A  £  An  that  contains  t. 
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Definition  6.23  (Labelled  net).  A  pair  (A,£)  if  called  a  labelled  net  if 

1.  A  =  {An}nez  is  an  increasing  sequence  of  partitions  ofT. 

2.  diam(T)  <  2an  for  every  A  g  An,  n  G  Z. 

3.  £  :  A  — >  N  satisfies  {£(B)  :  B  g  c(A)}  =  {1, . . . ,  |c(^4)|}  for  all  A  g  A. 


That  is,  a  labelled  net  is  an  increasing  family  of  partitions  A,  together  with 
a  labeling  £  that  defines  an  ordering  among  all  elements  of  each  partition  that 
share  the  same  parent.  Such  a  construction  is  illustrated  in  the  following  figure. 


I— 

I— 

:  3 

I— 

:  i 

I— 


2 


2 


2  1 


I  Ak0  — 

3  :  A 

- 1 - 1  -o.feo  +  1 

— I - 1  Ak0+2 

2  :  2  1  i  . 

— I - 1 1  Ak o+3 


m 


Each  horizontal  interval  represents  a  partition  of  T,  and  the  numbers  indicate 
an  assignment  of  labels  to  each  partition  element.  The  dotted  lines  indicate 
the  children  of  each  partition  element.  Note  that  each  t  g  T  defines  a  vertical 
slice  through  this  picture.  Listing  the  labels  one  encounters  along  this  slice 
from  top  to  bottom  gives  the  sequence  £(Ak0{t)),  £{Ak0  +i  (*)),-.. 

We  are  now  ready  to  state  a  form  of  the  ultimate  chaining  bound  for 
Gaussian  processes  due  to  Talagrand. 

Theorem  6.24  (The  majorizing  measure  theorem).  Let  {Xt}teT  be  a 
separable  Gaussian  process.  Then  we  have  for  universal  constants  C\,  C2,  a 


ci'yiT)  <  E 


sup  Xt 
teT 


<  C27(T). 


Here  we  defined 


y(T)  :=  inf  sup  V  ak ^/log£(Ak(t)), 

{A’e)  t6T  fcez 

where  the  infimum  is  taken  over  all  labelled  nets  (A,£). 

Let  us  take  a  moment  to  consider  what  we  have  achieved.  Theorem  6.24 
gives  matching  upper  and  lower  bounds  for  the  expected  supremum  of  a  Gaus¬ 
sian  process.  We  can  therefore  conclude  that  we  have  completely  understood 
the  magnitude  of  the  supremum  of  Gaussian  processes  in  terms  of  chaining! 
On  the  other  hand,  the  chaining  object  that  arises  in  Theorem  6.24  is  of  a  very 
sophisticated  form  (necessarily  so,  as  we  must  account  explicitly  for  the  inho¬ 
mogeneity  of  the  Gaussian  process):  to  find  a  good  bound  in  this  manner  we 
must  be  able  to  construct  a  “good”  labelled  net.  Unlike  the  covering  numbers 
that  arose  in  Theorem  5.24,  which  are  often  easy  to  estimate,  constructing 
good  labelled  nets  “by  hand”  in  inhomogeneous  situations  is  generally  an  ex¬ 
ceedingly  difficult  task.  It  may  therefore  be  unclear  at  this  point  that  Theorem 
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6.24  has  any  practical  utility.  It  turns  out  that  Theorem  6.24  is  a  powerful  tool 
that  makes  it  possible  to  prove  useful  and  deep  results  about  the  suprema  of 
random  processes  that  do  not  appear  to  be  readily  established  by  other  means. 
We  will  encounter  some  examples  of  such  results  in  the  next  section. 

Remark  6.25.  The  bookkeeping  in  the  chaining  argument  can  be  done  in  sev¬ 
eral  different  ways.  We  have  chosen  the  labelled  net  as  the  basic  object  in 
our  development  of  Theorem  6.24  as  its  definition  is  tailored  to  the  applica¬ 
tion  of  Proposition  6.21.  The  name  “majorizing  measure  theorem”  refers  to  a 
different  method  of  bookkeeping  that  was  used  in  the  original  formulation  of 
Theorem  6.24,  where  role  of  the  labels  t  is  replaced  by  the  definition  of  a  mea¬ 
sure  on  the  index  set  T  that  assigns  larger  mass  to  “fatter”  partition  elements. 
This  idea  will  be  developed  in  Problem  6.7  below.  Yet  another  formulation,  in 
terms  of  admissible  nets,  dispenses  entirely  of  the  need  for  explicitly  labelling 
partition  elements.  This  idea  will  be  developed  in  the  next  section. 


Let  us  turn  to  the  proof  of  Theorem  6.24.  We  begin  by  proving  the  upper 
bound,  which  is  an  almost  immediate  consequence  of  Proposition  6.21. 

Proof  (Upper  bound).  As  in  the  proof  of  Theorem  5.24,  it  suffices  to  consider 
the  case  that  T  is  a  finite  set.  In  the  following,  we  fix  a  labelled  net  (A,t), 
and  let  fco  be  the  largest  integer  such  that  Ak0  =  {T}.  We  aim  to  show  that 


E 


sup  Xt 
t£T 


<  c'  sup  E  afc\A°g  t(Ak(t)). 


t£T 


k>k0 


Note  that  if  fco  =  —  oo,  then  the  right-hand  side  of  this  inequality  is  infinite 
and  the  statement  is  trivial.  We  may  therefore  assume  that  fco  >  —  oo. 

The  proof  is  now  easily  completed.  By  Proposition  6.21,  we  have 


E 


sup  Xt 
.  teA 


<  jnax ^  6ak { 1  +  y^og 1(B)}  +  E 


sup  Xt 
teB 


for  any  A  G  Ak-  Iterating  this  inequality  n  times  starting  at  k  =  ko  yields 


E 

sup  Xt 

(  fco+ra-l 

—  sup  i  Y'  6afc{l  +  yJ\ogt{Ak+1(t))}  +  E 

sup  Xs 

.  teT 

teT  i  ta o 

.  seAk0+n(t) 

<  ^ - f  -  sup  V  c^-y/log  £(Ak(t)) 

l- a  a  tGT  f-f 


provided  that  n  is  chosen  sufficiently  large.  Here  we  have  used  that  as  T  is 
assumed  to  be  finite,  the  remainder  term  vanishes  uniformly  in  t  for  large  n. 

It  remains  to  eliminate  the  additive  constant.  To  this  end,  note  that  by 
the  definition  of  fc0,  there  exists  t  €  T  such  that  £(Ako+i(t))  =  2,  so  that 


afc°+1\/iog2  <  sup  ^2  aky/log£(Ak(t)). 


teT 


k>ko 


The  proof  is  now  easily  completed  with  C2  =  6a  1{1  +  1/(1  —  a)Vlog2}.  □ 
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We  now  turn  to  the  lower  bound.  The  difficulty  here  is  that  the  lower 
bound  of  Corollary  6.20  requires  a  packing,  while  the  labelled  net  is  defined 
in  terms  partitions.  Of  course,  the  duality  between  packing  and  covering  will 
be  essential  here,  but  the  situation  proves  to  be  somewhat  more  delicate  than 
we  have  previously  encountered.  To  understand  the  problem,  let  us  try  to 
apply  a  naive  duality  argument  to  the  first  chaining  iteration.  Assume  for 
simplicity  that  diam(T)  =  ak° .  To  apply  the  lower  bound,  we  first  choose  a 
maximal  afeo+1-packing  iVfc0+i  =  {t\, . . . ,  tr}  of  T.  Then  Corollary  6.20  gives 

E  supAt  >  max  <  c'ako+1  \Aogcr(fc)  +  E  sup  Xt  l 
.teT  J  k<r  [  lt£B(tk,ako+ 2)  JJ 

for  a  suitable  choice  of  a.  We  now  define  the  first  nontrivial  partition  Ak0+i  = 
{Ai, . . . ,  Ar}  of  our  labelled  net  by  setting  Ak  =  {t  €  T  :  7Tfc0+i  (t)  =  tk },  and 
define  the  label  t(Ak)  =  a(k).  By  maximality  of  the  packing,  each  set  Ak  has 
diameter  at  most  2afeo+1  as  required.  Then  Proposition  6.21  gives 

E  supXt  <  max  \  cako+1  \/log  er(fc)  +  E  sup  Xt  >. 

_teT  J  k<r  [  Lte^fc  JJ 

Unfortunately,  we  are  now  stuck:  while  the  primary  terms  in  the  upper  and 
lower  bounds  match,  the  remainder  terms  are  not  necessarily  comparable. 
Indeed,  in  the  lower  bound,  we  only  see  the  supremum  of  the  process  over 
small  balls  B(tk,ako+2)  centered  at  each  point  in  the  packing,  while  in  the 
upper  bound  we  have  the  supremum  over  every  element  of  a  partition  of  the 
set.  If  we  attempt  to  iterate  this  procedure,  we  will  therefore  miss  in  the  lower 
bound  all  elements  of  the  partitions  An  in  subsequent  stages  n  >  kg  +  1  that 
are  not  included  in  one  of  the  balls  B(tk,ako+2). 

The  solution  to  this  problem  lies  in  a  clever  organization  of  the  duality 
argument.  Rather  than  choosing  any  maximal  packing  7Vfc0+i,  we  will  choose 
the  points  in  such  a  way  that  the  expected  supremum  of  the  process 

over  each  of  the  balls  B(tk,ako+2)  is  maximized.  Because  of  this  choice,  the 
expected  supremum  of  any  element  of  a  partition  at  a  smaller  scale  is  bounded 
above  by  the  expected  supremum  over  B(tk,  ako+2),  and  we  can  therefore 
recover  all  elements  of  the  labelled  net  in  the  lower  bound.  In  the  end,  the 
argument  is  not  any  more  difficult  than  the  naive  duality  argument:  the  key 
to  the  proof  is  the  insight  that  one  must  organize  the  duality  argument  at  a 
given  scale  with  subsequent  iterations  of  the  chaining  argument  in  mind. 

Proof  (Lower  bound).  Define  for  any  subset  ACT 

G(A)  :=  E  sup  At  . 

.  teA 

We  can  assume  that  G(T)  <  oo,  as  the  lower  bound  is  trivial  otherwise.  This 
implies  that  N  (T,  d,e)  <  oo  for  all  £  >  0  by  Sudakov’s  inequality,  and  thus 
diam(T)  <  oo.  Let  fco  be  the  largest  integer  such  that  2ak°  >  diam(T). 
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To  prove  the  lower  bound,  we  must  construct  a  labelled  net  (A,  t)  so  that 

G(T)  >  Cl  ^  ak  Vlog £(Ak(t)) 
fee  z 

for  every  t  £  T.  To  this  end,  we  first  let  Ak  =  {T}  for  all  k  <  fco  (with 
£{T)  =  1).  We  now  construct  Ak  for  k  >  ko  iteratively  in  the  following  manner. 

Suppose  Ak  has  been  constructed.  We  will  construct  Ak+i  by  partitioning 
every  element  A  £  Ak  into  smaller  subsets  as  follows. 

1.  Choose  t\  £  A  so  that  G(A  n  B(ti,  afc+2))  is  maximized. 

2.  Let  Ai  =  A  fl  B(ti,ak+1)  and  £{A\)  =  1. 

3.  Choose  t2  €  A\Ai  so  that  G(A\Ai  n  B(t 2,  ak+2))  is  maximized. 

4.  Let  A2  =  A\Ai  D  B(t2,  ak+1)  and  £(A2)  =  2. 

5.  Choose  t3  £  A\(AiUA2)  so  that  G(A\(AiUA2)n.B(i3,  ak+2))  is  maximized. 

6.  . . .  etc. 

This  construction  is  illustrated  in  the  following  figure: 


The  optimization  over  the  choice  of  t,;  ensures  that  G(H)  <  ak+2)) 

for  any  set  H  C  A,  that  is  contained  in  a  ball  of  radius  ak+2.  This  will  allow 
us  to  control  the  remainder  term  in  Corollary  6.20.  On  the  other  hand,  in  each 
stage  we  remove  from  the  set  A  a  ball  B(ti,  ak+1)  with  a  larger  radius  ak+1. 
This  ensures  that  d(U,tj)  >  ak+1,  so  that  {ti,t2,  ■  ■  ■}  form  an  afc+1-packing 
of  A  as  is  required  to  apply  Corollary  6.20.  This  also  implies  that  the  above 
construction  must  terminate  after  a  finite  number  of  steps,  as  the  set  T  has 
finite  packing  numbers  (as  N (T,  d,e)  <00  for  all  e  >  0) . 

Suppose  that  the  above  construction  terminates  after  r  steps.  Then 
{Ai, . . . ,  Ar}  must  be  a  partition  of  A,  each  A;  has  a  distinct  label  £(Ai)  =  i, 
and  diam(Ai)  <  2afc+1  by  construction.  By  partitioning  every  A  £  Ak  in 
this  manner,  we  have  constructed  a  labelled  partition  Ak+i  of  T  that  satisfies 
all  the  properties  required  of  a  labelled  net.  We  now  iterate  this  process  to 
construct  Ak+2,Ak+3,  and  so  forth,  to  obtain  a  labelled  net  (A,£). 

Now  consider  again  A  £  Ak  and  the  partition  {Ai, . . . ,  A,.}  and  packing 
{t\, ...  ,tr}  constructed  above.  As  G(B(ti,ak+2))  is  decreasing  in  i,  we  have 

G(A)  >  max{caA:'t'1-\/logf(Ai)  +  G(B(ti,  ak+2))} 

i<r 
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by  Corollary  6.20.  Now  note  that  for  any  t  €  Ai,  we  have  Ak(t)  =  A ,  Ak+i(t)  = 
Ai ,  Ak+3(t)  C  Ai,  and  diam(Afe+3(t))  <  2ak+3  <  ak+2.  Thus  G(Ak+3(t))  < 
G(B(ti,ak+2))  by  the  maximality  property  of  ti,  and  we  obtain 

G(Ak(t))  >  cak+lsJ\ogt{Ak+1{t))  +  G(Ak+3(t)). 

This  identity  holds  for  every  t  £  T  and  k  >  ko-  As  in  the  proof  of  Theorem 
6.19,  this  inequality  “skips”  from  scale  ak  to  afc+3,  so  we  can  iterate  starting 
at  k  =  ko,  ko  —  1,  ko  —  2  and  average  these  lower  bounds  to  obtain 


fee  z 


As  this  holds  for  every  t  £  T,  the  proof  is  complete.  □ 

Remark  6.26.  Throughout  this  section,  we  have  fixed  a  as  defined  in  Theorem 
6.14.  All  our  constructions,  including  the  definition  of  a  labelled  net,  were 
stated  in  terms  of  this  universal  constant.  However,  it  should  be  noted  that 
while  a  must  be  sufficiently  small  to  ensure  the  validity  of  Theorem  6.14, 
the  precise  value  of  a  has  no  particular  significance:  in  particular,  we  can 
replace  a  by  any  j3  <  a  throughout  at  the  expense  only  of  changing  the 
universal  constants  that  appear  in  Theorem  6.24.  In  view  of  Problem  6.1,  we 
may  therefore  fix  an  arbitrary  value  a  <  j  throughout  this  section. 


Problems 


6.5  (Classical  chaining  and  labelled  nets).  As  the  chaining  functional 
7(T)  of  Theorem  6.24  is  equivalent  to  the  supremum  of  the  Gaussian  process 
up  to  universal  constants,  any  upper  bound  on  the  latter  must  also  be  an  upper 
bound  for  y(T)  up  to  a  universal  constant.  This  is  the  case,  in  particular,  for 
all  the  chaining  bounds  that  we  constructed  previously.  It  is  straightforward 
but  instructive,  however,  to  give  a  direct  proof  that 


by  constructing  a  simple  labelled  net  that  witnesses  the  upper  bound.  Simi¬ 
larly,  give  a  direct  proof  of  the  improved  chaining  bound 


pOO 

7 (T)  <  sup  /  y/log  N(B(t,  ce ) ,  d,  e)  de 

teT  Jo 


that  was  investigated  in  Problem  6.4  above. 

6.6  (A  nonstationary  process  revisited).  In  Problem  6.3  we  considered 
the  decidedly  nonstationary  Gaussian  process  {X„}neN  defined  by 

Y  — 

Vl  +  logn’ 
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where  {<7n}raeN  are  i.i.d.  N(0, 1).  The  expected  supremum  of  this  process 
is  finite,  but  none  of  the  chaining  bounds  that  we  obtained  previously  was 
able  to  capture  this  fact  (see  Problems  6.3  and  6.4).  As  Theorem  6.24  is 
sharp,  however,  there  must  exist  a  labelled  net  that  witnesses  the  finiteness 
of  E[supn  Xn\.  Construct  such  a  labelled  net  explicitly. 

Hint:  choose  partitions  of  the  form  Ak  =  {{1},  {2}, . . . ,  {nk},  N  IT  }nk,  oo[}. 

6.7  (Majorizing  measures).  In  the  original  formulation  of  Theorem  6.24, 
the  bookkeeping  in  the  chaining  argument  was  not  done  in  terms  of  labelled 
nets  but  rather  in  terms  of  “majorizing  measures”.  The  goal  of  this  problem 
is  to  develop  this  alternative  formulation  of  Theorem  6.24. 

We  begin  by  proving  a  discrete  version  of  the  majorizing  measure  bound 


7  (T) 


inf  sup  >  < 


I  log 


=:  m, 


where  A  =  {Ak}ke%  is  an  increasing  sequence  of  partitions  of  T  such  that 
diam(A)  <  2 an  for  all  A  £  An,  and  /i  is  a  probability  measure  on  T.  The 
majorizing  measure  p  here  plays  the  role  of  the  labels  in  the  definition  of  y(T): 
evidently  /i  should  assign  larger  mass  to  “fatter”  partition  elements. 

a.  Show  that  y(T)  <  y(T). 

Hint:  if  pi  >  P2  >  •  •  •  >  pr  >  0  and  Y^i=i  Pi  —  then  pi  <  1/i  for  every  i. 

To  establish  the  converse  inequality,  we  must  be  able  to  construct  a  majorizing 
measure  p  from  labels  t.  The  problem  here  is  that  1  / p(Ak(t))  must  be  increas¬ 
ing  in  k,  while  there  is  no  ordering  relation  between  the  labels  £(Ak(t)).  The 
appropriate  property  is  easily  engineered,  however,  by  “integrating  by  parts” . 

b.  Let  {bk}ke z  be  any  sequence  such  that  bk  =  0  for  all  k  sufficiently  small. 
Prove  the  elementary  “integration  by  parts”  identity 

^  akbk  =  (1  -  a)  ^  akBk,  Bk  :=  ^  bm. 

k€.7j  kdzZ  m<.k 

c.  Conclude  that 


l(T)  ^  .iff  sup^V  dog 

V  m<k 

d.  Let  (A,£)  be  a  labelled  net,  and  let  fco  be  the  largest  integer  such  that 
Ako  =  {T}.  Fix  an  arbitrary  tA  G  A  for  every  A  £  An,  n  £  Z.  Show  that 


S  ^  AAn(Cl))2  “ 

AGAk  m<k 


2  \  fe-fco 


<  2 


k—ko 
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e.  In  the  setting  of  the  previous  part,  define  the  probability  measure 

m  *  e  z-2{k-ko)  e  stA  n  1 

k>ko  m<k 

Show  that  for  every  t  el  and  k  CZ 
1 


l0g  n{Ak(t))  ~  2(fc_  fc0)log2  +  21og  J \t{Am(t)). 


f.  Conclude  that  y(T)  >  y(T). 

The  original  formulation  of  the  majorizing  measure  theorem  was  in  terms  of 
an  integral  rather  than  a  sum,  in  analogy  to  Corollary  5.25: 


y(T)  x  inf  sup 

m  teT 


1 


de  =:  7 (T). 


It  might  seem  at  first  sight  that  the  continuous  formulation  is  simpler,  as  it 
does  not  explicitly  involve  a  choice  of  partitions.  However,  in  applications  of 
the  majorizing  measure  theorem,  the  discrete  formulation  is  often  easier  to 
use  and  more  natural  as  it  is  closer  to  the  underlying  chaining  mechanism. 
We  will  presently  prove  the  continuous  formulation  as  well. 

g.  Deduce  from  the  discrete  majorizing  measure  bound  that  y(T)  >  7 (T). 

The  converse  inequality  is  much  more  difficult,  as  we  must  now  construct  a 
sequence  of  partitions  which  was  somehow  lost  in  the  continuous  formulation 
of  the  majorizing  measure  bound.  In  fact,  we  might  as  well  construct  an  entire 
labelled  net.  To  this  end,  let  us  define  for  every  ACT  the  functional 


F(H)  :=  inf  sup 
v  teA 


/‘diam(A) 


d£' 


It  turns  out  that  F(A)  behaves  very  much  like  G{A)  :=  E[supt6AXt]. 

h.  Suppose  that  a  <  |.  Prove  the  following  “super-Sudakov  inequality”  for 
the  functional  F:  if  N  is  an  e-packing  of  ACT,  then 

F{A)  >  C£\/ log  | N |  +  minE(H  ft  B(s,  as)). 

s£N 


Hint:  use  that  if  B\, . . . ,  Br  are  disjoint,  then  /i(H,)  <  1/r  for  some  i. 

i.  Repeat  the  proof  of  Theorem  6.24  to  show  that  7 (T)  <  F(T)  =  7 (T). 
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The  majorizing  measure  theorem  developed  in  the  previous  section  completely 
characterizes  the  supremum  of  Gaussian  processes  in  terms  of  chaining.  From 
the  fundamental  viewpoint,  this  provides  us  with  substantial  insight  into  the 
nature  of  Gaussian  processes.  On  the  other  hand,  it  is  far  from  clear  at  this 
point  that  this  is  a  useful  result:  labelled  nets  are  intricate  chaining  objects 
that  are  usually  difficult  to  construct  for  any  given  problem.  In  this  section,  we 
will  develop  some  alternative  formulations  of  the  majorizing  measure  theorem 
and  show  how  they  can  be  used  to  prove  some  highly  nontrivial  results  about 
Gaussian  and  subgaussian  processes.  While  we  only  scratch  the  surface  of 
what  can  be  done  with  this  machinery,  the  results  developed  in  this  section 
give  a  flavor  of  the  manner  in  which  such  machinery  is  applied. 

We  begin  with  a  simple  but  very  important  extension  of  Theorem  6.24. 
In  both  the  upper  bound  and  lower  bound  of  Theoren  6.24,  we  have  used  the 
Gaussian  nature  of  the  process  {A In  the  lower  bound,  of  course,  we 
already  heavily  used  the  Gaussian  property  even  to  prove  Sudakov’s  inequal¬ 
ity  at  a  single  scale.  In  the  upper  bound,  however,  we  only  used  Gaussian 
concentration  in  Proposition  6.21  to  handle  the  remainder  term;  the  rest  of 
the  proof  used  a  simple  union  bound  and  did  not  use  any  special  properties 
of  Gaussians.  On  the  other  hand,  note  that  all  we  will  do  with  the  remainder 
term  in  Proposition  6.21  is  to  apply  the  same  result  to  it  again  in  the  next 
iteration  of  the  chaining  argument.  If,  rather  than  running  our  chaining  ar¬ 
gument  one  iteration  at  a  time,  we  were  to  bound  all  the  links  in  the  chain 
at  once  as  we  did  in  the  proof  of  Theorem  5.29,  then  Gaussian  concentration 
is  no  longer  needed  in  the  upper  bound.  In  particular,  this  implies  that  the 
upper  bound  in  Theorem  6.24  only  requires  that  {Xt}teT  is  subgaussian! 

Theorem  6.27  (Generic  chaining).  Let  {Xt}t^r  be  «  separable  subgaus¬ 
sian  process  on  (T,  d) .  Then  we  have  for  a  universal  constant  c 

E  supAt  <  ct(T). 

.  teT 

Proof.  We  begin  by  arguing  as  in  the  proof  of  Theorem  5.29.  As  usual,  it 
suffices  to  assume  that  T  is  a  finite  set.  Let  (A,£)  be  any  labelled  net,  and 
let  ko  be  the  largest  integer  such  that  Ak0  =  {T}.  Choose  for  every  A  G  A 
an  arbitrary  point  tA  G  A,  and  define  7 Tfc(f)  :=  tAk(t. )  for  every  t  G  T.  As  T  is 
finite  and  the  diameter  of  Ak(t)  decreases  to  zero,  we  evidently  have 

Xt  ~  xt0  =  |AAfc(t)  — 

k>k0 

where  to  =  i Tk0  ( t ) .  This  is  the  usual  chaining  identity. 

Let  us  define  a  suitable  function  u  :  A  — >  [l,oo[  to  be  chosen  later.  Then 
it  follows  immediately  from  the  subgaussian  assumption  that 


180 


6  Gaussian  processes 


P[Xrfc(t)  -  Xnk_l{t)  >  xak  1  A/log  u(Afe(t))]  <  U (Ak(t))  x2,s, 

where  we  have  used  that  d(iTk(t),  nk-i(t))  <  diam(Afc_i(t))  <  2afe_1  by  the 
definition  of  a  labelled  net.  We  therefore  obtain  by  the  union  bound  that 

P[f2x]  :=  P[3  k  >  ko,  t&T  s.t.  XVkW  -  X7rfc_1(t)  >  xa^1  \/log  u(Afc(f))] 

<EE  u(A)-2/8, 

k>ko 

while  we  evidently  have  on  the  event 

sup{Xt  -Xto}<  -  sup  EE  a^Viogu^feW)- 
teT  a^Tk>k0 

This  simple  computation  contains  the  entire  idea  behind  the  generic  chaining 
bound.  The  challenge  is  to  choose  the  function  u  such  that  the  bound  on  the 
supremum  of  the  Gaussian  process  is  as  small  as  possible,  while  we  can  still 
control  the  probability  of  the  bad  events  f2x  (once  we  have  a  good  bound  on  the 
probabilities,  we  obtain  a  bound  on  the  expectation  as  usual  by  integration). 
In  view  of  Theorem  6.24  we  would  really  like  to  choose  u(A)  =  but  this 
is  clearly  not  a  good  idea:  there  are  many  sets  A  £  A  with  label  t(A)  =  1, 
and  thus  one  cannot  control  our  bound  on  P[f?a,]  in  this  manner. 

To  get  around  this  problem,  note  that  we  have  a  lot  of  freedom  in  how  to 
arrange  a  geometric  sum.  This  idea  is  extremely  useful  in  chaining  arguments. 

Lemma  6.28.  Let  a  <  1  and  >  1  for  all  k  >  k o-  Then 

(1  -  a)  EE  ak\/logUk  <  EE  \/l°g  U/c  with  Uk  :=  um. 

k>k0  k>ko  ko<.m<k 

Proof.  As  Uk  =  Uk-i‘Uk  for  k  >  ko  +  1,  we  can  estimate 

EE  \A°g  Uk  <  EE  aky/logUk-i+  EE  akV^°S  uk 

k>k0  /c>fco  +  l  k>ko 

=  a  EE  afc\A°g  Uk  +  EE  «fe\AogWfc- 

k>k0  k>ko 

The  inequality  now  follows  readily.  □ 

The  advantage  of  this  simple  reformulation  is  that  Uk  is  much  larger  than 
Uk,  while  the  geometric  sum  differs  by  at  most  a  constant  factor.  To  put  this 
idea  to  good  use,  let  us  define  for  every  fc  >  fc0  and  t  G  T 

u(Afc(f))  =  2fc_fc°  H  £(Am(t))2. 

ko<.m<k 


Then  we  have  on  the  event  flf 
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sup{X*  -  Xto}  <  ak°  1x^aky/k  log2  H - T - -sup  ak  ^  log  t(Ak(t)) 

tGT  all  —  a)  tGr  z — 

k> 0  V  7  Zkz  k>k0 


<  C\X  sup  E  uksJ\og  £(Ak(t)) 


using  Lemma  6.28,  where  ci  is  a  constant  that  depends  on  a  only  and  where 
the  second  inequality  follows  as  in  the  upper  bound  proof  of  Theorem  6.24. 
On  the  other  hand,  note  that  by  the  definition  of  a  labelled  net 

,  kb4)|  1 

y^  _ _ =  y^ _ <  2 

A-'  1(B)2  m2 

Bec(A)  '  '  m—l 

for  every  A  €  A,  so  that  we  can  estimate 

Jx,  fco U<fc  £XU^))2  A6^_i  b^a)  W  fco<jjfc_1  *(- Am(tA ))2 

<2  e  n  £(Am{tA))2  < ■" <2fc  fco’ 

We  can  therefore  estimate  for  every  a:  >  4 

mi  £  £  •£  n  juiuw  - C22"i!/8' 

k>k0  AeAk  k0<m<k  A>> 

where  c2  is  a  universal  constant.  We  have  now  finally  proved  that 

P  sup{Xt  —  Xto}  >  ax  sup  ak y/hgi{Aijf))  <  c2 2~x2/8. 

lteT  teTk7t o 

for  x  >  4.  Using  E  [Z]  <  /0°°  P  [Z  >x]dx  <  4+ /4°°  P  [Z  >  x\  dx  and  optimizing 
over  all  labelled  nets  (A,  l)  completes  the  proof  of  the  Theorem.  □ 

We  now  immediately  obtain  our  first  nontrivial  application  of  the  majoriz¬ 
ing  measure  theorem.  The  statement  of  this  result  is  so  simple  that  one  would 
expect  that  there  must  be  an  elementary  proof;  but  no  other  proof  is  known. 

Corollary  6.29  (Subgaussian  comparison  theorem).  Let  {Ft}teT  be  a 
separable  Gaussian  process  with  natural  metric  d,  and  let  {Xt}t^T  be  a  sepa¬ 
rable  subgaussian  process  on  (T,  d) .  Then  for  a  universal  constant  C 


E  sup Xt  \  <  CE  sup Yt 
Iter  J  L*6T 

Proof.  Combine  Theorems  6.27  and  6.24. 
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Remark  6.30.  A  comparison  theorem  of  this  kind  can  be  very  useful  in  prac¬ 
tice.  In  many  problems,  it  is  possible  to  explicitly  compute  the  supremum  of  a 
Gaussian  process  by  exploiting  special  properties  of  Gaussians  (e.g.,  rotation 
invariance).  One  can  then  invoke  Corollary  6.29  to  show  that  the  same  bound 
applies  when  the  Gaussian  variables  are  replaced  by  subgaussian  ones,  even 
though  one  cannot  perform  explicit  computations  in  the  general  setting. 

While  Corollary  6.29  is  a  trivial  consequence  of  the  generic  chaining 
method,  most  applications  require  one  to  work  in  a  nontrivial  manner  with 
the  chaining  bounds.  So  far  we  have  taken  care  of  the  bookkeeping  in  the 
chaining  argument  in  terms  of  labelled  nets,  as  this  formulation  arose  in  the 
most  natural  manner  from  the  investigation  of  Gaussian  processes.  A  labelled 
net  is  a  somewhat  unwieldy  object,  however:  not  only  must  one  construct 
increasing  partitions,  but  one  must  also  keep  track  of  labels  along  the  way. 
We  will  presently  develop  an  alternative  way  to  organize  the  generic  chaining 
bounds  that  dispenses  with  the  need  to  keep  track  of  the  labels. 

The  basic  idea  that  will  be  used  in  the  sequel  is  as  follows.  In  all  the  chain¬ 
ing  arguments  that  we  have  used  above,  we  fixed  at  each  scale  the  diameter  of 
the  sets  A  £  Ak  but  allowed  an  arbitrary  number  of  such  sets.  An  alternative 
way  of  organizing  the  chaining  argument  is  to  fix  the  number  of  sets  in  the 
partition  A but  to  allow  their  diameters  to  vary.  As  a  warm-up  execise,  let 
us  reformulate  the  simple  entropy  integral  bound  from  the  previous  chapter 
(Corollary  5.25)  in  this  manner.  Recall  that  the  covering  number  N(T,d,e) 
denotes  the  smallest  number  of  e-balls  needed  to  cover  T.  If  we  define 

en{T)  :=  inf{e  :  N(T,  d,  e)  <  22"}, 

then  the  entropy  number  en(T)  is  the  smallest  radius  e  for  which  one  can  cover 
T  by  less  than  22  e-balls  (the  mysterious  22  will  be  explained  shortly).  To 
formulate  the  chaining  bound  in  terms  of  entropy  numbers,  note  that 

r°o  _  ren(T) 

/  \A°g  N{T,  d,  e)  =  \A°g  N(T,  d,  e)  de. 

do  n>0  d Gn+i  (T) 

Using  that  22"  <  N(T,d,e)  <  22"+1  when  en+i(T)  <  e  <  e„(T),  we  obtain 

POO 

/  ^log  N(T,d,e)  de~J2  2"/2K(T)  -  en+1(T)}  x  ^  2 n/2en{T). 

d°  n>  0  n>  0 

Thus  we  obtain  a  bound  in  terms  of  entropy  numbers  that  is  entirely  equiva¬ 
lent,  up  to  the  constants,  to  the  entropy  integral  of  Corollary  5.25. 

Remark  6.31.  Let  {/3„}  be  an  increasing  sequence  with  /3q  =  2,  and  define  the 
/3-entropy  numbers  =  inf{c  :  N(T,d,s )  <  /?„}.  Then  we  can  estimate 

X!  \A>g/3„{e£-e£+1}  <  /  sj log  N(T,  d,  e)  de  <  ^  yJ\og^n+1{e^-epn+1} 

n>  0  do  n>Q 
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by  arguing  as  above.  In  order  for  the  left-  and  right-hand  sides  to  be  compara¬ 
ble,  we  must  have  log/3n+i  <  log  (3n,  which  means  that  log/3„  should  increase 
at  most  exponentially.  This  explains  why  we  chose  /3n  =  2?  above  (of  course, 
any  ab  for  a,  b  >  1  would  give  equivalent  results  up  to  universal  constants.) 

We  now  develop  a  formulation  of  the  generic  chaining  bound  along  these 
lines.  The  remarkable  feature  of  this  formulation  is  that,  somewhat  surpris¬ 
ingly,  there  is  no  longer  a  need  to  keep  track  of  a  label  for  each  partition 
element:  the  labels  are  “hidden”  in  the  diameters  of  the  partition  elements. 

Definition  6.32  (Admissible  net).  An  increasing  sequence  of  partitions 
A  =  {An}n> o  of  T  is  called  an  admissible  net  if  \An\  <  22  for  every  n  >  0. 

Theorem  6.33  (Labelled  and  admissible  nets).  There  exist  universal 
constants  ci,C2  such  that  Cij'(T)  <  y(T)  <  C2"f'(T).  Here  we  defined 

7 '(T)  :=  inf  sup  2™/2  diam(A„(f)), 

A  terfri 


where  the  infimum  is  taken  over  all  admissible  nets  A. 

To  illustrate  the  idea  of  the  proof,  consider  the  upper  bound  y(T)  <  * /(T ). 
For  any  admissible  net  A' ,  we  must  construct  an  labelled  net  (A,tj  such  that 


We  can  view  any  increasing  sequence  of  partitions  as  a  partition  tree  with  a 
directed  edge  from  A  to  B  if  B  £  c( A) .  A  cut  in  the  tree  is  a  set  of  vertices  23 
such  that  every  branch  of  the  tree  contains  exactly  one  element  of  23.  Clearly 
any  cut  of  a  partition  tree  is  itself  a  partition.  The  idea  of  the  proof  is  to 
define  each  partition  An  by  taking  the  smallest  possible  cut  in  A!  such  that 
each  element  of  An  has  diameter  at  most  2an .  Then  the  above  inequality  will 
follow  if  we  assign  labels  in  order  of  increasing  depth  of  the  elements  in  the 
original  tree  A! .  This  construction  is  illustrated  in  the  following  figure. 


l 

I.  3 - 

2 

I - 1 

I— I - 1— I— I - \ 


4  5  6  7  8 


9  10 

— H — I 


ll 


H - 1 - \A'2 


e(A) 


A  <E  A± 


K- 


2a 


Proof  (Upper  bound).  Let  A!  be  an  admissible  net,  and  define 
rifc(t)  =  inf{n  :  diam(A'n(f))  <  2 ak} 

for  every  k  £  Z  and  t  GT  (we  may  assume  that  Uk{t)  <  oo  for  every  fc,t,  as 
otherwise  the  quantity  in  the  definition  of  7 '(T)  will  be  infinite).  Let  fco  be 
the  largest  integer  such  that  diam(T)  <  2ak° ,  and  define  A  =  {Ak}kez  as 
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Ak  =  {T}  for  k  <  k0,  Ak  =  {A'nk^(t)  :  t  G  T}  for  k  >  k0. 

Clearly  Ak  defines  a  cut  in  A' ,  and  thus  A  is  an  increasing  sequence  of  parti¬ 
tions  as  in  the  definition  of  a  labelled  net.  We  now  assign  labels  such  that  if 
Ak-\(t)  =  then  l{Ak{t))  >  £(Ak(t'))  whenever  nk(t)  >  nk(t'). 

Now  note  that  we  can  reorganize  the  sum  in  the  definition  of  7'  (T)  as 


E  2"/2  diam«  w)  =  E  E  2"/2  diam«  w) 

n> 0  fe> feo  rik—i  (t)<n<rik  (t) 


> 


> 


E  2afc  E  2"/2 

k>k  0  nk-i  (t)<n<nk(t) 

V2  E  ak2nkW/2l 

k>ko 


We  now  claim  that  2nfcW /2lnk^^nk_1^y/\og  2  >  ^\ogl{Ak{t)).  To  see  this, 
note  that  if  nk(t)  =  nk-i(t),  then  Ak(t)  is  the  only  child  of  Ak-i(t)  and 
thus  £(Ak(t))  =  1,  while  we  must  have  £(Ak(t))  <  <  22"fc(t)  as  the 

labels  are  sorted  by  increasing  depth  in  A' .  Thus  we  have  shown  that  for  every 
admissible  net  A' ,  there  exists  a  labelled  net  (A,i)  such  that 


E  2n/2  diam (A'n(t))  >  J E  ak \/log £(Ak(t)) 

n> 0  *  °Z  fee z 


for  all  t  GT.  Taking  the  supremum  over  t,  the  infimum  over  (A,C),  and  then 
the  infimum  over  A  yields  7 (T)  <  C2')'(T)  with  C2  =  y/2_1  log  2.  □ 

The  proof  of  the  lower  bound  follows  along  very  similar  lines:  starting  from 
a  labelled  net  we  will  choose  cuts  A'n  such  that  \A'n\  <  22  . 

Proof  (Lower  bound).  This  time  we  start  with  a  labelled  net  (A,£).  Let  ko  be 
the  largest  integer  such  that  Ako  =  {T},  and  define  the  quantity 

u(Ak(t))  =  4fc-fc«  H  £(Am(t))2. 

ko<.m<k 

Then  we  have  as  in  the  proof  of  Theorem  6.27  for  a  universal  constant  c 
sup  y2ak \/l°g £(Ak(t))  >  csup  E  afc\A°gu(^fc(>))- 

teT  ke Z  tGT  k>k0 

We  now  define  a  cut  in  A  by  setting 

kn(t)  =  sup {k  >  k0  :  u (Ak(t))  <  22"}. 


Note  that  kn(t)  <  oo  as  u(Afc(t))  increases  to  infinity  (this  is  the  reason  why 
we  work  with  the  cumulative  labels  u(Afc(t))  rather  than  the  labels  £(Ak(t))). 
Thus  we  can  define  the  increasing  sequence  of  partitions  A'  =  {A'n}n>o  as 
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A'n  —  {A kn(t){t)  ■  t  G  T}. 


As  u(Afc(t))  >  22"  when  k  >  kn(t),  we  can  estimate 

E  ak^/logu(Ak(t))  =  Y  Y  «fe\/logu (Ak(t)) 

k>ko  n>0  kn(t)<k<kn-i-i(t) 


> 


> 


> 


- - V  2"/2{aMt)  _  a*»+i(*)} 

1  —  a 


n>  0 


v  v  7  n> 0 


Thus  the  only  thing  that  remains  to  be  proved  is  that  A!  is  an  admissible  net. 
If  this  is  the  case,  then  taking  the  supremum  over  t,  the  infimum  over  A',  and 
then  the  infimum  over  A  yields  the  result  C\^'{T)  <  7 (T). 

It  therefore  remains  to  show  that  \A'n\  <  22  .To  this  end,  note  that  by 
the  definition  of  a  labelled  net,  every  partition  element  Ak(t)  £  Ak  gives  rise 
to  a  distinct  sequence  of  labels  £(Ak0+i{t)) , . . .  ,£(Ai-(t)).  Thus 

\A'n\<Y  Y,  1<‘-‘»ntl<„a<L<2!n 

k>k0  40+i>--.,4eN 

<22"E4_fc  e  n  ^<22"’ 

fe>i  4,—,4eNi<m<fc  m 

asSfc4  fe  Z)4,...,4  rim  ^  =  12 k  (fl)  ~  0.7.  □ 

While  the  formulation  in  terms  of  admissible  nets  is  entirely  equivalent 
to  the  formulation  in  terms  of  labelled  nets,  the  former  can  often  be  simpler 
to  use  in  applications  as  there  are  no  labels  to  keep  track  of.  To  illustrate  a 
nontrivial  result  that  can  now  readily  be  obtained,  let  us  prove  a  remarkable 
fact  about  the  geometry  of  Gaussian  processes  on  Rn. 

For  any  subset  T  C  R”,  let  us  define  the  Gaussian  width  g(T)  as 


g(T)  :=  E 


sup  Y  9iU 


teT 


gi,---,gn  ~  i.i.d.  iV(o,  1). 


That  is,  g(T)  is  the  expected  supremum  over  T  of  the  Gaussian  process  whose 
natural  distance  is  the  Euclidean  distance.  We  begin  with  an  easy  example. 

Lemma  6.34.  Let  T  =  {t,k  :  k  >  2}  with  supfc  || tk  —  s||-\/log  k  <  a  for  some 
s  £  R".  Then  g(T )  <  Ca  for  a  universal  constant  C . 
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Proof.  As  Xk  =  X]"=i  9i{tki  -  Si}  ~  N( 0,  \\tk  -  s||2),  the  union  bound  gives 


P 


sup  Xk  >  x 
_k>  2 


<  ^e-x2/2\\tk-sf 
k>  2 


<  ^  k  x2/2°2. 

fc>  2 


For  a;  >  2a,  the  right-hand  side  is  <  C' 2  *  /2a  for  a  universal  constant  C' . 
Thus  c/(T)  <  2a  +  C'  f.^  2~x^ /2“2  dx  <  Ca  for  a  universal  constant  C.  □ 

We  now  make  a  trivial  observation:  as  the  supremum  of  a  linear  function 
L(t)  over  T  C  K”  equals  the  supremum  over  the  closed  convex  hull  conv  T, 
we  immediately  obtain  g{T )  =  g(convT)  for  any  set  T.  This  implies: 

Corollary  6.35.  Let  T  C  con v{tk  :  k  >  2}  with  supfc  ||tfc  —  s||\/logfc  <  a  for 
some  s  £  K".  Then  g(T )  <  Ca  for  a  universal  constant  C . 

This  easy  example  gives  us  a  simple  geometric  principle  to  control  the 
Gaussian  width:  if  T  is  contained  in  the  convex  hull  of  a  sequence  of  points 
tk  — >  s  that  converge  at  rate  a/y/Togk,  then  its  Gaussian  width  g(T)  is  con¬ 
trolled  by  a.  However,  this  sort  of  principle  appears  to  be  completely  arbitrary: 
we  could  have  started  with  any  example  in  which  we  can  compute  explicitly 
the  Gaussian  width  (for  example,  ellipsoids  or  squares)  and  deduce  an  anal¬ 
ogous  geometric  principle.  The  completely  unexpected  feature  of  Corollary 
6.35,  however,  is  that  it  admits  a  sharp  converse. 

Theorem  6.36.  There  is  a  universal  constant  K  such  that  whenever  g(T)  < 
Ka,  there  exist  s,  {t.k}  with  supfc  \\tk  —  sj|  \/log  k  <  a  andT  C  con v{tk  :  k  >  2}. 

Combining  Corollary  6.35  and  Theorem  6.36  immediately  yields  the  fol¬ 
lowing  geometric  characterization  of  the  Gaussian  width: 


g(T)  x  inf  sup  \\tk  -  s||\/log/e  :  T  C  con v{tk  :  k  >  2} 
l  k>2 

This  remarkable  result  appears  as  a  complete  mystery  at  this  point.  However, 
much  of  the  mystery  is  about  to  disappear:  as  we  will  see  presently,  Theorem 
6.36  is  little  more  than  a  reformulation  of  the  majorizing  measure  theorem. 
The  key  idea  is  that  the  points  tk  are  none  other  than  rescaled  versions  of  the 
“links”  7 Tn(t)  —  7rn_i(f)  that  appear  in  the  chaining  argument. 

Proof.  By  the  majorizing  measure  theorem,  there  is  an  admissible  net  A  with 

^2”/2diam (An(t))  <  cg(T ) 

n>  0 

for  all  t  £  T.  Choose  for  every  A  £  A  an  arbitrary  point  Ia  €  A ,  and  define 
7 Tn(t)  :=  tAn(t)-  Fix  also  an  arbitrary  point  s  £T  and  let  7r_i(f)  :=  s.  Define 
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q  m  _  2"/2||t r„(i)  -  7Tn_i(t)||  _  Cg(T)  nn(t)  -  n n_i(t) 

2»/2  ||7rn(t)-7rn_1(t)|| 

for  n  >  0.  As  ||7T„(t)  —  f||  <  diam(A„(f))  — »  0,  we  have 

t  =  s+J2Mt)  -  TTn-i(t)}  =  s  +  y ^0n{t)xn(t). 

n> 0  n> 0 

As  g(T)  >  E[(g,t)  V  (g,s)]  =  E|(<7,  ^)|  =  \\t  -  s\\/\/2tt  for  all  t  G  T, 

E&w  <  w?  +  c^f)  E2"/2diam(^-iW)  ^ 1 

n>0  ^  '  n>l 

if  we  choose  C  =  +  c\j 2.  Thus 

T  C  conv{s  +  xn[t)  :  n  >  0,  t  G  T}  =:  conv{zfe  :  A:  >  1}, 

where  z*,  have  been  sorted  such  that  || Zk  —  s||  is  nonincreasing. 

Now  note  that  ||a;n(t)||  =  C  g{T)2~n/2 ,  while  there  are  at  most  |An  ||A„_i| 
such  terms.  We  can  therefore  readily  estimate 

n—  1 

max{fc  :  || Zk  —  s||  >  Cg(T) 2~n^2}  <  E  22  22  <  n22  <  22  +  . 

k- 0 


Thus  we  have  for  all  n  >  0  and  22"+1  <  k  <  22"+~ 

„  „  ^  Cg(T)  2v/Io PCff(T)  ^  2^2Cg{T) 

h-»n<-^-=  ^  - 

Setting  t-k+ i  =  Zfc  so  that  T  C  convjtfc  :  k  >  2},  we  can  readily  choose  K  such 
that  g(T)  <  Ka  implies  || tk  —  s||  <  a/V log  k  for  all  k  >2.  □ 

We  have  seen  above  several  different  but  closely  related  formulations  of  the 
generic  chaining  bound:  in  terms  of  labelled  nets  (Theorems  6.24  and  6.27),  in 
terms  of  majorizing  measures  (Problem  6.7),  and  in  terms  of  admissible  nets 
(Theorem  6.33).  We  conclude  this  section  by  developing  a  dual  formulation  of 
the  generic  chaining  bound.  Beside  that  this  very  useful  result  is  of  significant 
interest  in  its  own  right,  we  will  isolate  along  the  way  a  fundamental  idea  that 
underlies  many  applications  of  the  generic  chaining  machinery. 

Let  us  begin  by  motivating  why  we  develop  yet  another  formulation  of 
the  generic  chaining.  The  definition  of  y(T)  involves  an  infimum  over  labelled 
nets:  this  means  that  in  order  to  obtain  an  upper  bound  on  the  supremum 
of  a  given  Gaussian  process,  we  only  need  to  exhibit  one  particular  labelled 
net  for  which  the  quantity  in  the  definition  of  y(T)  is  small.  In  essence,  this 
is  what  we  have  been  doing  in  the  previous  chapter:  it  is  easy  to  construct 
labelled  nets  by  piecing  together  e-nets  at  different  scales,  in  which  case  we 


188 


6  Gaussian  processes 


recover  the  entropy  integral  of  Corollary  5.25  (cf.  Problem  6.5).  However,  to 
have  a  sharp  understanding  of  the  supremum  of  a  given  Gaussian  process,  we 
must  also  obtain  a  matching  lower  bound.  It  is  very  difficult  to  obtain  lower 
bounds  on  7 (T),  as  this  would  require  us  to  argue  that  the  quantity  in  the 
definition  of  y(T)  is  large  for  every  possible  choice  of  the  labelled  net. 

One  should  think  of  a  labelled  or  admissible  net,  which  defines  a  covering 
of  the  set  T  at  many  different  scales,  as  a  multiscale  counterpart  to  the  notion 
of  an  e-net,  which  defines  a  covering  of  the  set  T  at  a  single  scale  e.  From  this 
viewpoint,  the  majorizing  measure  theorem  states  that  the  expected  supre¬ 
mum  of  a  Gaussian  process  over  T  is  equivalent  up  to  universal  constants  to 
the  smallest  “size”  (in  the  7(T)-sense)  of  a  multiscale  covering  of  T .  The  clas¬ 
sical  duality  between  packing  and  covering  now  suggests  an  interesting  idea:  is 
there  a  corresponding  multiscale  counterpart  to  the  notion  of  an  e-packing,  so 
that  the  supremum  of  a  Gaussian  process  is  equivalent  up  to  the  largest  “size” 
of  a  multiscale  packing?  This  is  precisely  the  idea  that  will  be  developed  in  the 
remainder  of  this  section.  Such  a  dual  formulation  is  precisely  what  one  needs 
in  order  to  obtain  lower  bounds  on  the  supremum  of  a  Gaussian  process. 

It  is  not  difficult  to  find  a  good  candidate  for  the  notion  of  multiscale  pack¬ 
ing.  Recall  that  there  was  no  mystery  in  the  definition  of  a  labelled  net:  this 
notion  was  simply  designed  to  obtain  the  best  possible  upper  bound  on  the 
supremum  of  a  Gaussian  process  using  the  super-chaining  principle  (Propo¬ 
sition  6.21).  To  obtain  a  notion  of  multiscale  packing,  we  apply  precisely  the 
same  idea  in  the  opposite  direction:  we  design  an  object  that  yields  the  best 
possible  lower  bound  using  the  super-Sudakov  inequality  (Theorem  6.14).  To 
help  us  with  the  bookkeeping,  let  us  introduce  some  useful  structures. 

Definition  6.37  (Trees).  A  T- tree  is  a  family  T  of  nonempty  subsets  of  T 
such  that  T  £  7,  and  for  all  C,C'  £  T  either  CnC'  =  0,  C  C  C',  or  C'  C  C. 

The  definition  of  a  tree  is  illustrated  in  the  following  figure  (the  base  set 
T  is  duplicated  several  times  to  clarify  the  positions  of  the  elements  of  T): 

1 - a  t 

I - H  T 

I - 1  T 

I - 1  T 

It  is  not  difficult  to  see  that  a  T-tree  can  be  thought  of  as  a  directed  tree  in 
the  graph-theoretic  sense.  The  root  of  the  tree  is  T,  and  the  children  of  a  node 
c(A)  and  the  leaves  of  the  tree  1(7)  are  defined  by  inclusion  in  the  obvious 
fashion.  For  every  leaf  A  £  1(7),  we  will  denote  the  corresponding  branch  of 
the  tree  as  Aq  C  A\  C  . . .  (starting  at  the  root  Aq  =  T). 

An  increasing  sequence  of  partitions,  such  as  in  the  definition  of  a  labelled 
net,  naturally  defines  a  T-tree  with  the  additional  property  that  its  leaves 
cover  T.  In  contrast,  in  a  multiscale  notion  of  packing,  we  would  like  the 
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children  of  each  node  in  the  tree  to  be  well  separated.  The  following  notion  is 
specifically  designed  in  order  to  apply  Theorem  6.14. 

Definition  6.38  (Packing  tree).  A  packing  tree  (T,  x)  is  a  T-tree  7  to¬ 
gether  with  a  map  x  :  T  — >  Z  such  that  the  following  holds  for  every  A  ^  1(7): 

1.  For  every  C  G  c(A),  there  exists  tc  G  T  such  that  C  C  B(tc,  a>tr(j4)+1). 

2.  d(tc,tc ')  >  u*[A)  for  all  C,  C'  G  c(A),  C  ^  C' . 

We  can  now  state  a  dual  form  of  the  majorizing  measure  theorem  (where 
we  note  that  the  upper  bound  holds  already  when  {Xt}teT  is  subgaussian.) 

Theorem  6.39  (Dual  majorizing  measure  theorem).  Let  {Xt}t£T  be  « 
separable  Gaussian  process.  Then  we  have  for  universal  constants  Ci,  C2 

Ci7"(T)<E  supXt  <c27"(T). 

.  teT 

Here  we  defined 

7  "(T)=  sup  inf  ^a^lVlogKil, 

where  the  supremum  is  taken  over  all  packing  trees  (T,  x). 

While  we  only  formally  defined  the  notion  of  a  packing  tree  here,  this  is  not 
the  first  time  that  we  have  enocountered  this  idea:  we  essentially  constructed  a 
packing  tree  in  the  proof  of  Theorem  6.19.  The  special  feature  of  the  stationary 
case  is  that  the  packing  tree  is  regular,  so  that  7 "(T)  can  be  expressed  in  terms 
of  the  packing  numbers  of  T.  Then  the  equivalence  between  7 "(T)  and  the 
entropy  integral  follows  from  the  simple  duality  between  packing  and  covering 
numbers  each  scale.  Theorem  6.39  could  be  viewed  as  a  generalization  of  this 
idea  to  the  nonstationary  setting.  This  result  lies  much  deeper,  however,  as 
we  must  now  run  the  duality  argument  in  a  multiscale  fashion. 

We  now  turn  to  the  proof  of  Theorem  6.39.  The  lower  bound  is  easy:  it 
follows  almost  trivially,  by  design,  from  iterating  the  super-Sudakov  inequality. 

Proof  (Lower  bound).  Given  a  packing  tree  (7,x),  we  obtain 


by  repeatedly  applying  Theorem  6.14  starting  from  the  root  of  the  tree  (we 
do  not  need  to  worry  about  the  remainder  term  at  the  end  of  the  chaining 
argument  as  this  is  a  lower  bound).  Now  take  the  supremum  over  (7,x).  □ 

The  interesting  part  of  the  proof  is  the  upper  bound  7(T)  <  7 "(T).  It  turns 
out  that  we  already  did  almost  all  the  necessary  work  in  the  proof  of  the  lower 
bound  in  Theorem  6.24,  but  this  is  not  at  all  obvious  at  the  moment.  Let  us 
therefore  first  give  an  abstract  statement  of  what  we  accomplished  there. 
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Definition  6.40  (Growth  functional).  A  map  F  :  2r  — >  M+  is  called  a 
growth  functional  if  F(B)  <  F(A)  whenever  B  C  A  C  T ,  and 

F(A)  >  coFsJ log  \N\  +  minF(A  n  B(s,  an+l )) 

sGN 

whenever  N  is  an  an  -packing  of  ACT  for  some  n  £  Z. 

Theorem  6.41  (Partitioning  scheme).  7 (T)  <  KF[T)  for  any  growth 
functional  F  (the  constant  K  depends  only  on  c,a  in  the  definition  of  F). 

Proof.  Repeat  verbatim  the  proof  of  the  lower  bound  of  Theorem  6.24,  replac¬ 
ing  the  special  growth  functional  G(A)  by  F(A)  throughout.  □ 

The  key  insight  behind  Theorem  6.41  is  that  in  order  to  upper  bound  j(T) 
in  the  proof  of  the  majorizing  measure  theorem,  the  only  Gaussian  property 
that  we  used  was  the  super-Sudakov  inequality.  Thus  we  can  use  the  same 
proof  to  bound  j(T)  by  any  other  object  that  satisfies  the  super-Sudakov 
inequality.  Theorem  6.41  turns  out  to  be  perhaps  the  most  important  tool 
in  applications  of  the  majorizing  measure  theorem:  while  it  is  exceedingly 
difficult  to  construct  good  labelled  nets  (or  even  admissible  nets)  by  hand  in 
any  given  situation,  it  is  often  much  more  promising  to  try  to  guess  the  form  of 
a  growth  functional  that  captures  the  geometry  of  the  problem.  Thus  Theorem 
6.41  provides  a  powerful  tool  to  obtain  upper  bounds  on  j(T)  in  different 
problems  (supposing,  of  course,  that  the  easiest  entropy  integral  bounds  from 
the  previous  chapter  do  not  suffice) .  We  presently  give  a  simple  illustration  of 
this  idea  by  completing  the  proof  of  the  upper  bound  in  Theorem  6.39. 

Proof  (Upper  bound).  It  suffices  by  Theorem  6.27  to  show  that  q(T)  < 
Kj" (T)  for  a  universal  constant  K.  To  this  end,  we  will  show  that  7"  is 
itself  a  growth  functional,  so  that  the  proof  is  complete  by  Theorem  6.41. 

Fix  a  set  S  C  T  and  an  a"-packing  N  of  S.  Let  £  >  0,  and  choose  for 
every  s  £  N  a  packing  tree  (Ts,  xs)  of  S'  ft  B(s,  an+1)  such  that 

inf  Y  \J log  |cM„)|  >  miny'YS  D  B(s,an+1))  —  e. 

A£l(7B)  '  sGN 

n>  0 


Now  define  a  new  tree  T  =  {S}  U  Useiv^s’  anc^  assign  labels  x(A)  =  xs{A) 
for  and  x{S)  =  n.  Then  clearly  (T,  x)  is  a  packing  tree  of  S  and 


7"(S)  >  M  Xa^nVlogl c(An)\ 

>  an\/log  jfVj  +  min7"(S  fl  B(s,  an+1))  —  £. 

s€N 


Letting  £  j  0  shows  that  7"  satisfies  the  super-Sudakov  inequality.  As  7"  is 
clearly  increasing  7 "(A)  <  7 "(B)  for  A  C  £?,  it  is  a  growth  functional.  □ 
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Problems 


6.8  (Chaining  with  admissible  nets).  The  formulation  of  the  majorizing 
measure  theorem  in  terms  of  admissible  nets  seems  to  be  somewhat  simpler 
than  the  formulation  in  terms  of  labelled  nets,  as  there  are  no  labels  to  keep 
track  of.  In  fact,  from  the  point  of  view  of  the  upper  bounds,  chaining  with 
admissible  nets  is  even  easier  than  using  labelled  nets. 

a.  Give  a  direct  proof,  along  the  lines  of  Theorem  6.27,  of  the  fact  that  if 
{Xt}t&T  is  a  separable  subgaussian  process  on  ( T,d )  then 


E 


sup  Xt 

teT 


It  is  in  fact  also  possible  to  give  a  direct  proof  of  the  lower  bound  in  the  ma¬ 
jorizing  measure  theorem  in  terms  of  admissible  nets.  However,  this  approach 
is  less  intuitive  than  the  proof  in  terms  of  labelled  nets,  as  we  lose  the  natural 
symmetry  between  the  upper  and  lower  bounds  in  the  chaining  argument. 

Let  us  now  consider  a  less  structured  variant  of  the  notion  of  an  admissible 
net.  Call  A  =  {An}n> o  an  admissible  family  if  each  An  individually  is  a 
partition  of  T  with  \An\  <  2 2  ,  but  where  we  do  not  make  the  assumption 
that  the  sequence  of  partitions  is  increasing.  Define 


7p (T)  :=  inf  sup  2"/2  diam(A„(f)), 
A  teTfri 


with  the  infimum  taken  over  all  admissible  families  A. 

b.  Show  that  7q(T)  <  7 '(T)  <  Kyo(T)  for  a  universal  constant  K. 

Hint:  given  an  admissible  family  A,  define  an  increasing  sequence  of  parti¬ 
tions  23  by  letting  23„  be  the  partition  generated  by  Aq,  ■  ■  ■ ,  An. 


c.  Give  a  direct  proof  of  the  upper  bound  in  terms  of  entropy  numbers 

7'(T)<J^2n/2en(T) 

n>  0 

that  is  equivalent  to  the  simple  chaining  bound  in  the  previous  chapter. 

6.9  (Separated  trees).  A  separated  tree  (T,  x)  is  a  T-tree  O'  together  with 
a  map  x  :  T  — *  Z  such  that  for  every  A  ^  Z(T),  we  have  d(C,  C’)  >  and 

x(C)  >  x(A)  for  all  C,C  €  c(A),  C  yf  C' .  Thus  a  separated  tree  is  a  less 
structured  variant  of  a  packing  tree  where  we  have  no  control  of  the  diameters 
of  the  elements  of  a  separated  tree.  Nonetheless,  we  will  see  that  the  quantity 

7o(T)  =  SUP  inf  ^a"(^Vlog|c(A„)|, 

where  the  supremum  is  taken  here  over  all  separated  trees  (T,  x),  plays  an 
equivalent  role  to  the  quantity  7 "(T).  This  is  not  at  all  obvious,  as  we  cannot 
apply  the  super-Suclakov  inequality  without  control  on  the  diameter. 
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a.  Show  that  y"( T )  <  7q  (T). 

b.  Show  that  j'0 \{T)  <  7 (T). 

Hint:  fix  a  separated  tree  (T,  >r)  and  labelled  net  (A,  £).  Now  argue  as  follows 
starting  from  the  root  Bo  of  T:  as  the  children  c(Bo)  are  -separated, 

each  element  of  A>c(b0)+i  can  intersect  at  most  one  element  of  c(B 0).  Thus 
we  can  choose  Bi  £  c{B0)  and  Ax  £  ■Ax(b0)+i  with  £(Ai)  >  |c(f?0)|.  Now 
iterate  this  procedure  to  select  a  full  branch  B0,B  1, . . .  of  T  and  a  sequence 
Aj+i  £  A^(b.)+i  with  £{Ai+i)  >  \c{Bi)\.  Finally,  compare  the  sums  that 
appear  in  the  definitions  of  jg  (T)  and  7 (T)  for  this  selection. 

c.  Conclude  that  7 "(T)  x  jg  (T). 

6.10  (Ultrametrics).  A  (finite)  ultrametric  space  ( U,d )  is  a  (finite)  set  U 
together  with  a  metric  d  on  U  that  satisfies  the  ultra-triangle  inequality 

d(u,  v)  <  max{<i(it,  w),  d(v,  tu)}  for  all  u,v,w  £  X. 

Ultrametric  spaces  play  an  important  role  in  the  geometry  of  metric  spaces, 
where  they  play  a  role  analogous  to  that  of  Hilbert  spaces  in  functional  analysis 
(any  finite  ultrametric  space  can  be  isometrically  embedded  in  £2',  the  proof  of 
this  fact  is  left  to  the  interested  reader) .  They  also  arise  naturally  in  statistical 
physics,  computer  science,  and  computational  biology. 

a.  Let  U  be  a  finite  set  and  1  be  a  U- tree  whose  leaves  are  the  singletons  {it}. 
Fix  5  :  T  — ■>  R+  so  that  5({u})  =  0  and  5(C)  <  5(A)  if  C  £  c(A),  and  let 

d(u,v)  =  S(A(u,v )),  A(u,v )  =  f'|{C'  G  T  :  C  D  {u, u}}. 

Show  that  (U,  d)  is  an  ultrametric  space. 

b.  Let  (U,d)  be  a  finite  ultrametric  space.  Show  that  there  is  a  tree  T  and 
assignment  5  :  T  — >  M+  as  in  part  a.  such  that  d(u,v )  =  5(A(u,  v)). 

Hint:  show  that  if  (17,  d)  is  ultrametric,  then  balls  B(u,e)  and  B(v,e)  that 
do  not  coincide  must  be  disjoint.  Thus  {B(u,  e)  :  u  £  U}  is  a  partition. 

A  finite  metric  space  (U,d)  K -embeds  in  an  ultrametric  space  if  there  is  an 
ultrametric  dn  on  U  such  that  K~1du(u,v)  <  d(u,v)  <  Kdu(u,v )  for  all 
u,v  £  U .  This  idea  proves  to  be  intimately  related  to  Gaussian  processes. 

c.  Prove  the  following  formulation  of  the  majorizing  measure  theorem:  there  is 
a  universal  constant  K  so  that  for  any  separable  Gaussian  process  {Xt}teT, 
there  is  a  finite  subset  U  C  T  that  A'-embeds  in  an  ultrametric  space  with 


E 

sup  Xu 

<  E 

sup  Xt 

<KE 

sup  Xu 

.uGU 

.  teT 

_u£U 

Hint:  consider  a  more  structured  notion  of  packing  tree  with  the  additional 
requirement  that  each  A  G  T  has  diameter  <  a .  Use  a  minor  modifica¬ 
tion  of  Theorem  6.41  to  show  that  Theorem  6.39  still  holds  for  the  modified 
packing  tree.  Finally,  use  the  packing  tree  to  define  a  suitable  ultrametric. 
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Notes 

§6.1.  The  inequalities  of  Slepian-Fernique  and  Sudakov  are  classical  results 
on  Gaussian  processes.  The  approach  starting  from  Gaussian  interpolation 
(Lemma  6.9)  is  due  to  Slepian  [73].  We  follow  Chatterjee  in  using  the  conve¬ 
nient  approximation  of  the  maximum  in  the  proof  of  Theorem  6.5  (see  [2]).  See 
[95]  for  more  on  applications  to  random  matrices  (Problems  6.11  and  6.12). 
The  convex  geometry  proof  of  Problem  6.13,  due  to  Talagrancl  [51],  makes  it 
possible  to  extend  Sudakov’s  inequality  to  non-Gaussian  processes  [79] . 

§6.2.  The  super-Sudakov  inequality  is  due  to  Talagrand  [77].  The  alternative 
proof  of  Problem  6.1  is  taken  from  [53].  Theorem  6.19  is  due  to  Fernique  [35]. 
As  is  noted  in  [49],  the  super-Sudakov  inequality  makes  it  possible  to  give  a 
particularly  transparent  proof  that  is  almost  entirely  analogous  to  that  of  the 
chaining  upper  bound.  Problem  6.4  is  inspired  by  [81]. 

§6.3.  Talagrand’s  majorizing  measure  theorem  is  considered  to  be  notoriously 
difficult,  perhaps  because  the  complicated  chaining  object  that  arises  here 
looks  so  bizarre.  I  have  tried  to  tell  the  story  in  such  a  way  that  the  result  does 
not  appear  as  a  major  miracle,  but  rather  as  the  natural  consequence  of  basic 
properties  of  Gaussian  variables.  In  particular,  it  seems  that  the  symmetry 
between  Corollary  6.20  and  Proposition  6.21  is  the  central  idea  in  the  proof; 
once  this  has  been  understood,  it  should  be  almost  clear  why  the  result  must 
be  true.  The  proof  given  here  and  the  formulation  in  terms  of  labelled  nets  is 
the  one  developed  in  [77,  78];  the  presentation  is  inspired  by  [49,  43]  (I  learned 
the  proof  from  [49]).  Proposition  6.21  appears  in  [81]. 

The  original  proof  of  the  majorizing  measure  theorem  [75]  was  very  compli¬ 
cated,  as  everything  was  formulated  directly  in  terms  of  continuous  majorizing 
measures  which  are  not  well  suited  to  chaining.  A  good  exposition  of  it  can 
be  found  in  [1] .  The  most  recent  formulation  in  terms  of  admissible  nets  (sec¬ 
tion  6.4)  is  often  simpler  to  use,  but  a  direct  proof  of  the  majorizing  measure 
theorem  along  these  lines  [88,  89]  is  in  my  opinion  more  mysterious  as  the 
natural  symmetry  between  the  upper  and  lower  bounds  is  lost. 

The  (continuous)  upper  bound  in  the  majorizing  measure  theorem  as  for¬ 
mulated  in  Problem  6.7  is  much  older  and  is  due  to  Fernique  [35].  It  can  even 
be  developed  pathwise  as  a  real  analysis  lemma,  see  [8] . 

§6.4.  The  proof  of  Theorem  6.27  is  based  on  [83],  while  the  proof  of  Theorem 
6.33  is  inspired  by  [86].  The  remaining  results  in  this  chapter  are  taken  from 
[88,  89],  where  an  exhaustive  treatment  of  the  generic  chaining  method  and 
its  applications  is  given  (using  exclusively  the  admissible  net  formulation).  A 
remarkable  application  of  the  connection  with  separated  trees  can  be  found  in 
[23].  The  formulation  in  terms  of  ultrametric  spaces  (Problem  6.10)  is  implicit 
in  [75];  see  [59]  for  further  developments  in  this  direction. 
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In  the  previous  chapter,  we  have  developed  a  detailed  understanding  of  the 
supremum  of  a  Gaussian  process  {Xt}ter  by  chaining  with  respect  to  the 
natural  metric  d(t,  s )  =  ||Xt  —  Xs||2.  While  Gaussian  processes  are  important 
in  their  own  right,  in  many  applications  such  processes  arise  only  in  an  indirect 
manner.  Particularly  in  areas  such  as  statistics  and  machine  learning,  the  more 
fundamental  object  of  interest  is  the  empirical  process  {G„  (/)}/<=  3-  over  a  class 
of  functions  £F,  defined  in  terms  of  an  i.i.d.  sequence  X\,  X2,  *. .  ~  pt  as 

1  n 

G„(f)  :=  \fn{pnf  ~  Hf},  Mn=-y ]f(Xk). 

n  z — ' 

k=  1 

Understanding  the  supremum  of  the  empirical  process  determines  the  rate  of 
convergence  of  the  law  or  large  numbers  uniformly  over  a  class  of  functions  IF, 
and  thereby  the  performance  of  many  types  of  statistical  estimators.  Similar 
problems  arise  at  a  fundamental  level  in  the  geometry  of  Banach  spaces,  in 
combinatorial  set  theory,  and  in  many  other  applications. 

That  empirical  processes  are  closely  related  to  Gaussian  processes  is  ex¬ 
pressed  by  the  following  immediate  consequence  of  the  multivariate  CLT. 

Lemma  7.1  (Central  limit  theorem).  For  any  /1,  •  •  • ,  fk  £  we  have 

(Gn(/i),  •  •  ■ ,  Gn(fk ))  =>  ( Z(fi ), . . . ,  Z(fk))  in  distribution  as  n  — *■  00, 

where  {Z(f)} is  the  Gaussian  process  with  Co v[Z(f),Z(g)]  =  Cov^[/,g]. 

In  view  of  the  central  limit  theorem,  we  expect  that  the  empirical  process 
{G„(/)}/6j  should  in  some  sense  behave  like  the  Gaussian  process  { Z(/)}/6 3- 
when  n  is  sufficiently  large.  In  particular,  as  the  natural  metric  for  the  Gaus¬ 
sian  process  is  given  by  d(f}g)  =  Var M[/  —  g]1/2,  we  might  hope  to  control 
the  supremum  of  the  empirical  process  by  chaining  with  respect  to  d.  Of 
course,  the  empirical  process  is  not  actually  Gaussian  for  finite  n,  but  the 
Azuma-Hoeffding  inequality  (Lemma  3.6)  ensures  that  the  empirical  process 
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is  subgaussian  with  respect  to  the  metric  dao(f,g)  =  ||/  —  g||oo-  We  can  there¬ 
fore  directly  control  the  supremum  of  the  empirical  process  by  chaining  with 
respect  to  the  uniform  metric  (indeed,  we  have  already  seen  this  approach  in 
action  in  Example  5.28!)  The  problem  with  this  approach  is  that  the  uniform 
metric  can  be  much  larger  than  the  L 2  (/x)-metric  d  in  many  cases,  so  that 
we  can  incur  an  enormous  loss  of  efficiency  in  controlling  the  empirical  process 
as  compared  to  the  limiting  Gaussian  process.  Let  us  give  a  simple  illustration 
of  a  setting  where  this  issue  arises  in  a  dramatic  fashion. 

Example  1.2.  Let  X1;  X2, ...  be  an  i.i.cl.  sequence  of  real- valued  random  vari¬ 
ables  with  distribution  g.  By  the  law  of  large  numbers,  the  empirical  distribu¬ 
tion  function  Fn(x)  =  gn(}~ oo,  x\)  converges  a.s.  to  the  distribution  function 
F{x)  =  g(]~ 00,2:])  for  every  x  G  R.  However,  Glivenko  and  Cantelli  proved 
already  in  1933  that  the  convergence  is  even  uniform  in  x: 

1 1 Fn  E 1 1  oo  0  a.s. 

To  understand  this  phenomenon  (as  well  as  the  rate  of  convergence  at  which 
this  happens),  we  must  understand  the  supremum  of  the  empirical  process 

sup  | gnf  ~  M/I  =  ~)=  sup  |G„(/)| 

/e?  Vn  /e3" 

over  the  class  of  indicators  T  =  {lj-oo,®]  :  x  G  R}-  Now  note  that 

II1]— oo,*]  -  1] — oo,x/]  l|oo  =  1  whenever  x  ±  x' . 

Thus  evidently  N{1,  ||  •  ||oo,  e)  =  oo  for  every  £  <  1!  In  particular,  we  see  that 
no  chaining  argument  with  respect  to  the  uniform  metric  can  ever  capture 
the  uniform  convergence  of  the  empirical  process  over  the  class  T,  or  for  that 
matter  over  any  other  infinite  class  of  (indicators  of)  sets.  On  the  other  hand, 
it  is  not  difficult  to  see  that  the  covering  numbers  N{3,  ||  •  ||i2(jU),  e)  are  small, 
and  thus  the  Gaussian  process  {Z(/)}/6gr  is  easily  controlled  by  chaining. 

It  should  be  evident  from  the  above  discussion  that  a  direct  application 
of  the  methods  that  we  developed  so  far  to  control  the  suprema  of  random 
processes  fails  to  capture  the  behavior  of  empirical  processes.  In  order  to 
obtain  better  control  of  empirical  processes,  we  must  understand  in  what 
sense  the  behavior  of  such  processes  is  similar  to  that  of  the  Gaussian  limit. 
In  this  chapter,  we  will  develop  methods  to  “bring  out  the  Gaussian  nature” 
of  empirical  processes  and  to  control  the  resulting  inequalities. 


7.1  The  symmetrization  method 

One  of  the  most  fundamental  approaches  to  bringing  out  the  Gaussian  na¬ 
ture  of  empirical  processes  is  through  the  method  of  symmetrization.  To  un¬ 
derstand  this  idea  behind  this  method,  let  us  begin  with  a  (very)  informal 
discussion  of  “why  the  central  limit  theorem  works.” 
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Let  us  fix  a  bounded  function  /,  and  consider  the  sum  {/(-X'fe)  —  /it/}. 
As  this  sum  contains  n  terms  of  order  1  each,  this  quantity  could  be  as  large 
as  ~  n  in  the  worst  case.  However,  the  typical  situation  is  quite  different:  the 
central  limit  theorem  states  that  the  sum  is  only  of  order  yfn  in  probability!  Of 
course,  the  reason  for  this  is  clear.  In  order  for  the  sum  to  be  of  order  n,  most  of 
the  terms  in  the  sum  must  have  the  same  sign  so  that  their  contributions  add 
up.  But  as  the  terms  in  the  sum  are  independent  and  centered,  they  are  highly 
unlikely  to  all  be  of  the  same  sign;  to  the  contrary,  there  will  typically  be  many 
terms  of  opposite  sign,  so  that  most  of  the  terms  in  the  sum  cancel  rather  than 
adding  up.  This  cancellation  between  terms  of  different  sign  accounts  for  the 
major  reduction  in  scale  from  0(n )  to  only  0(y/n). 

The  cancellation  of  terms  of  different  signs  proves  to  be  the  key  mechanism 
of  the  central  limit  theorem:  it  is  the  aggregate  effect  of  random  signs  that 
leads  to  Gaussian  behavior.  The  remaining  features  of  the  distribution  p  are 
only  relevant  to  the  limiting  behavior  to  the  extent  that  they  determine  the 
scale  of  the  Gaussian  (i.e.,  its  variance).  This  suggests  that  in  order  to  bring 
out  the  Gaussian  nature  of  the  empirical  process,  we  should  somehow  isolate 
the  random  signs  in  such  a  manner  that  we  can  apply  the  machinery  developed 
in  the  previous  chapters  only  to  the  “Gaussian  part”  of  the  empirical  process. 
The  method  of  symmetrization  achieves  precisely  this  aim. 

Lemma  7.3  (Symmetrization).  Let  X\, . . .  ,Xn  be  i.i.d.  random  variables 
in  X  with  distribution  p,  and  let  T  be  a  class  of  functions  on  X.  Then 


E 

sup 

n 

^{/(Xfe)  -  pf } 

<  E 

sup 

n 

^2ek{f(Xk)-  f(Yk)} 

je? 

k=l 

fe ? 

k= 1 

<  2  E 

sup 

n 

k= 1 

where  Yi, . . . ,  Yn  is  an  independent  copy  of  X±, . . . ,  Xn,  and  eq, . . . ,  en  are 
i.i.d.  symmetric  Bernoulli  random  variables  independent  of  X,Y. 

Proof.  As  pf  =  E[/(Yfc)|A'i, . . .  ,X„],  Jensen’s  inequality  yields 


E 

sup 

n 

E{/(^)  -  pf} 

<  E 

sup 

n 

E{/(*fc)  ^  /(^)> 

je ? 

k= 1 

k= 1 

But  note  that  f(Xk)  —  f(Yk),  being  a  symmetric  random  variable  (hence  the 
name  symmetrization !),  has  the  same  law  as  £k{f(Xk)  —  f(Yl (,)}.  This  implies 


E 

sup 

n 

£{/(**)  -  f{Yk)} 

=  E 

sup 

n 

J2ek{.f(Xk)-f(Yk)} 

k= 1 

_fe v 

k—1 

which  proves  the  first  inequality.  The  second  inequality  follows  readily  using 
f(Xk)  -  f(Yk)  =  f(Xk)  -  pf  -  {/(Yfc)  -  pf}  and  the  triangle  inequality.  □ 
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Let  us  consider  what  we  have  achieved.  Define  the  process 

1  n 

Z n(f)  =  -=  $>*{/(**)  -  f(Yk)}. 

fe=  1 

At  first  sight,  this  process  seems  no  more  useful  than  the  empirical  process 
Gn(f ):  the  best  we  can  do  is  still  to  apply  the  Azuma-Hoeffding  inequal¬ 
ity,  which  shows  that  {Z„(/)}^6j  is  subgaussian  with  respect  to  the  uniform 
norm.  However,  this  is  not  the  right  way  to  bound  the  supremum  of  Zn(f). 
What  we  have  accomplished  here  is  to  isolate  the  behavior  of  the  signs:  the 
random  signs  ek  are  independent  of  the  remaining  randomness  in  the  problem. 
We  should  therefore  apply  our  machinery  conditionally  on  X ,  Y,  so  that  only 
the  “Gaussian  part”  of  the  process  remains.  If  we  apply  the  Azuma-Hoeffding 
inequality  conditionally  on  X,  Y,  we  find  that  the  process  { Zn(f)j /gg-  is  sub¬ 
gaussian  with  respect  to  the  random  metric  dn  on  defined  by 

1  n 

dn(f,g)  :=  -  £{/(**)  -  9{Xk)  -  f(Yk)  +  g{Yk)}2 

n 

L  /c=i 

To  interpret  this  metric,  note  that  by  the  law  of  large  numbers 
lim  dn{f,g)  =  E[{/(X1)  -  giX,)  -  f(Y±)  +  <?(U)}2]1/2  =  2  VarM[/  -  g}1'2, 

n — kx> 

which  is  none  other  (up  to  a  constant  factor)  than  the  natural  metric  d  for 
the  limiting  Gaussian  process  Z(f)\  Thus  the  symmetrization  method  isolates 
precisely  in  what  sense  the  empirical  process  approximates  the  Gaussian  pro¬ 
cess  Z(f):  by  Lemma  7.3,  controlling  the  supremum  of  the  empirical  process 
Gn{f )  is  equivalent  to  controlling  the  supremum  of  a  process  that  is  subgaus¬ 
sian  for  an  empirical  approximation  to  the  natural  metric  of  Z(f). 

Once  the  symmetrization  argument  has  been  understood,  we  can  apply  all 
the  machinery  developed  in  the  previous  chapters  conditionally  on  X,  Y.  For 
example,  applying  Corollary  5.25  conditionally  yields 


E  sup  \Gn(f)\ 
L/e? 


/log  N{3,  dn,  e)  de 


This  is  a  vast  improvement  over  the  analogous  bound  with  IV(3r,  ||  •  ||oo,  s)  that 
would  be  obtained  by  a  direct  application  of  Azuma-Hoeffding  to  Gn(f).  At 
the  same  time,  the  fact  that  we  have  to  deal  with  a  random  metric  dn  is  a 
nontrivial  complication:  to  control  the  covering  numbers  Nft,  dn,e)  we  must 
understand  the  random  geometry  of  the  metric  space  (T,  dn).  In  the  following 
sections  we  will  develop  some  tools  to  deal  with  this  problem. 

So  far  there  has  been  no  loss  in  our  estimates  except  universal  constants: 
Lemma  7.3  has  matching  upper  and  lower  bounds.  In  many  applications  of 
symmetrization,  however,  the  following  bounds  prove  to  be  convenient. 
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Lemma  7.4  (Symmetrization  II).  Let  X\, . . . ,  Xn  be  i.i.d.  random  vari¬ 
ables  in  X  with  distribution  fi,  and  let  T  be  a  class  of  functions  on  X.  Then 


E 


sup  J2{f(Xk)  -  pf} 


<  2E 


sup  Yskf(Xk) 


<  y/l’K  E 


sup  Y1  gkf(xk)  , 

\ 


where  Si, ...  ,en  are  i.i.d.  symmetric  Bernoulli  random  variables  and  gi, ...  ,gn 
are  i.i.d.  N(0, 1)  random  variables  independent  of  X. 


Remark  7.5.  The  symmetrization  method  has  its  origin  in  functional  anal¬ 
ysis,  where  symmetric  Bernoulli  random  variables  are  often  referred  to  as 
Rademacher  variables.  Thus  the  first  inequality  is  called  Rademacher  sym¬ 
metrization,  while  the  second  inequality  is  called  Gaussian  symmetrization. 


Proof.  It  follows  exactly  as  in  the  proof  of  Lemma  7.3  that 


E 


sup  Yif(Xk)  ~  M/} 


<  E 


n 

sup  Y£k{f(Xk)  ~  f{Yk)} 


Splitting  the  supremum  yields 


E 

n 

sup  YifYk)  -  gf} 

<  E 

n 

SUP  Y,£kf(Xk) 

+  E 

n 

SUP  Y(~£k)f(Yk) 

[f^Yi  \ 

[f^Yi  \ 

As  (sk,  Xk)  has  the  same  distribution  as  (— ek,  Yk),  the  first  inequality  follows. 
For  the  second  inequality,  as  E[|grfe 1 1^! ,  •  •  •  ,£n,  ■  ■  ■  ,  Af„]  =  we  have 


silP  Y£kfYk) 


=  E 


supE  Y£k\9k\f(Xk) 

l  t~i 


<  E 


snpYYMXk) 


But  as  gk  is  symmetric,  sk\gk\  has  the  same  law  as  gk,  and  we  are  done.  □ 


Lemma  7.4  has  two  advantages.  First,  the  natural  random  metric  associ¬ 
ated  with  the  symmetrized  process  is  the  L2(^.„)-metric 


11/  ^ 


1  " 

-YifYk)  -  g(xk)}2 


which  is  often  easier  to  control  than  the  metric  dn  defined  above  (while  the 
latter  is  more  precise,  in  most  applications  Lemma  7.4  suffices).  Second,  here 
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we  see  that  we  can  even  control  the  supremum  of  the  empirical  process  by 
the  supremum  of  a  true  Gaussian  process  (conditionally  on  X),  not  just  by 
a  subgaussian  process.  This  is  conceptually  pleasing,  but  does  not  make  any 
difference  in  most  applications:  upper  bounds  using  chaining  work  just  as 
well  for  Gaussian  processes  as  for  subgaussian  processes.  We  have  used  the 
Gaussian  property  much  more  heavily  in  deriving  lower  bounds;  however,  the 
Gaussian  symmetrization  is  not  necessarily  sharp,  so  that  we  cannot  derive 
lower  bounds  in  this  manner  without  further  work  (see,  however,  Problems 
7.1  and  7.2  below  for  situations  where  one  can  implement  this  idea). 

We  conclude  this  section  by  noting  that  we  can  use  symmetrization  not 
only  to  bound  the  expectated  supremum  of  the  empirical  process,  but  also  its 
tail  probabilities.  The  following  simple  tool  provides  one  way  to  do  this. 

Lemma  7.6  (Panchenko).  Let  X,Y  be  random  variables  such  that 

E[<Z>(X)]  <  E[<£(Y)] 

for  every  increasing  convex  function  <P.  If 

P[Y  >  t]  <  cie~C2t  for  all  t  >  0 
for  some  c\,a  >  1  and  c2  >  0,  then 

P[X  >t]<  Cie1-C2t  for  all  t>  0. 

Proof.  As  x  i— >  is  increasing  and  convex  for  every  a  >  1,  it  suffices  to 

consider  the  case  a  =  1.  Applying  the  assumption  to  <P(x)  =  (x  —  t)+  yields 

roo  fOO 

/  P[A  >  s]ds  <  /  P[F  >  s]  ds  <  —  e_C2<  for  all  t  >  0. 

Jt  Jt  c2 

Thus  we  have 

P[X  >  t]  <  —  [  P[X  >  s]ds  <  - —  cie-02*  for  all  t  >  a. 

'  "  a  Jt_a  c2a 

Choosing  the  optimal  value  a  =  l/c2  yields  the  result  for  t  >  l/c2,  while  the 
result  holds  trivially  for  t  <  l/c2  as  then  Cie1_C2t  >  1  >  P[X  >  t\.  □ 

Using  this  lemma,  we  readily  obtain  the  following  symmetrization  bound. 

Corollary  7.7  (Symmetrization  tail  bound).  Suppose  that 

n 

P  2  sup  £kf{Xk)  >  K  +  t  <  c\e~C2t  for  all  t  >  0 
for  some  constants  C\  >  1  and  c2,K  >  0.  Then 

n 

P  sup  y^{/(Xfc)  —  p.f}  >  K  +  t,  <2>c\e~C2t  for  all  t>  0. 


7.1  The  symmetrization  method 
Proof.  The  identical  proof  to  Lemma  7.4  shows  that 
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E 


/  n  > 
sup  J2{f(Xk)  -  nf} 


<  E 


<?(  2supV'£fc/(Xfc)  ] 

.  V  J 


for  any  increasing  convex  function  <P.  It  remains  to  apply  Lemma  7.6. 


□ 


Corollary  7.7  can  now  be  used  in  conjunction  with  results  such  as  Theorem 
5.29  to  obtain  tail  bounds  for  the  empirical  process  in  terms  of  chaining. 


Problems 


7.1  (Rademacher  and  Gaussian  processes).  Let  T  C  R".  In  the  proof  of 
Lemma  7.4,  we  have  seen  that  we  can  always  bound 


r(T)  :=  E 

n 

SUpY'  £ktk 

<  E 

n 

Slip'S^  gktk 

J 

LieT^i  J 

where  £\, . . .  ,sn  are  i.i.d.  symmetric  Bernoulli  and  gi, .  ■  ■  ,gn  are  i.ixl.  1V(0, 1). 
Unfortunately,  the  converse  inequality  does  not  hold  in  general. 

a.  Show  for  T  =  {t  €  R”  :  ||f||i  <  1}  that  r(T)  ~  1  and  g(T)  ~  \J2  log  n. 

b.  Evidently  r(T)  can  be  small  for  two  distinct  reasons:  either  because  g(T) 
is  small,  or  because  the  t\ -dia, meter  suptgT  ||t||i  is  small.  Combine  these  as 
follows:  if  T  C  Ti+T2  with  g(T\)  <  a  and  suptgT2  ||i||i  <  a,  then  r(T)  <  2a. 

A  deep  result,  conjectured  by  Talagrand  and  proved  by  Bednorz  and  Latala, 
shows  that  this  idea  captures  completely  the  behavior  of  the  Rademacher 
process:  if  r(T)  <  a,  then  T  C  Tj  +  T2  for  some  Ti,T2  such  that  g(T\)  <  ca 
and  suptgT2  ||£||i  <  ca.  This  result  is  proved  by  a  very  sophisticated  form  of 
the  generic  chaining  method,  and  is  beyond  our  scope. 

In  the  example  of  part  a.,  r[T)  and  g(T)  are  apart  by  a  factor  ~  ydog n. 
It  turns  out  that  this  is  the  worst  case  situation:  we  always  have 

r(T)<g(T)<r(T)^/I^n. 


This  could  be  deduced  from  Bednorz  and  Latala,  but  we  give  a  direct  proof, 

c.  Show  that  if  |ai|, . . . ,  \an\  <  1,  then 


E 

n 

sup  V  £ktkak 

<  E 

n 

sup  V  £ktk 

t gt 

L  k—1  J 

L  k—1  J 

Hint:  (ai, . . . ,  an)  e- >  E[suptgT  X]fc=i  £feU-Ufc]  is  convex. 


d.  Conclude  that  g(T)  <  r(T)y/logn. 
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We  have  seen  above  that  in  general,  the  supremum  of  a  Rademacher  pro¬ 
cess  and  a  Gaussian  process  can  be  far  apart.  However,  in  the  context 
of  the  symmetrization  Lemma  7.4,  the  situation  should  be  much  better 
than  in  the  general  case:  here  the  supremum  is  taken  over  the  random  set 
T  =  {(/(Xi), . . . ,  f(Xn))  :  f  G  lb}.  Informally  speaking,  the  typical  magni¬ 
tude  of  the  td-norm  of  an  element  of  this  set  is  of  order  n,  so  we  expect  that 
r(T)  can  be  small  only  if  g(T)  is  small.  Let  us  try  to  prove  such  a  result. 

e.  Provided  (ei, . . .  ,£&},  {gi, . . .  ,gk },  {X\, . . .  ,Xk}  are  independent,  show 
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sup  y^gkf(Xk) 
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supE^i^JTO 


dx 


pOO 

\{k<n:\gk\>x}\ 

/  E 

SUp  y~]  £kf{Xk ) 

Jo 

L f ^  k^i  \ 

Hint:  use  \gk\  =  f 0°°  l\gk\>xdx. 


f.  Let  ip  :  K+-.K+  be  a  concave  increasing  function.  Suppose  that 


E 


sup  Yekf(Xk) 


<  <p(n) 


for  all  n  >  1. 


Show  that 


E 


sup  y~]gkf(Xk) 

f^£~i 


<p(riP[\9i\  >  a;])  dx 


for  all  n  >  1. 


In  particular,  if  we  choose  ip(n)  =  cna  for  |  <  a  <  1,  then  we  can  control 
the  Gaussian  and  Rademacher  symmetrizations  by  the  same  rate. 


7.2  (The  Glivenko-Cantelli  theorem).  Let  Xi,X2,...  be  i.i.d.  random 
variables  with  distribution  /r  on  a  measurable  space  X,  and  let  T  be  a  class 
of  functions  on  X.  For  simplicity,  we  will  assume  throughout  this  problem 
that  the  class  T  is  uniformly  bounded  (and,  as  we  have  implicitly  assumed 
throughout  these  notes,  that  the  suprema  we  will  encounter  are  measurable). 
The  class  of  functions  T  is  said  to  be  /z- Glivenko-Cantelli  if 
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sup 

i  n 

-  £{/(**)  -  gf} 

je? 

n  z ' 

k= 1  J 

Technically  speaking,  such  a  class  is  called  weak  Glivenko-Cantelli,  as  opposed 
to  the  strong  Glivenko-Cantelli  property  that  requires  a.s.  convergence. 

a.  Show  that  the  weak  Glivenko-Cantelli  property  implies  the  strong  Glivenko- 
Cantelli  property  in  the  setting  of  this  problem  (of  uniformly  bounded  IF). 

Hint:  use  a  suitable  concentration  inequality  and  Borel-Cantelli. 
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Symmetrization  is  a  key  tool  to  understand  Glivenko-Cantelli  classes.  Let 
£1,  £2)  •  •  •  and  9ii92i  -  ■  •  be  i.i.d.  Rademacher  and  Gaussian  variables  as  usual. 

b.  Show  that  £F  is  Glivenko-Cantelli  if  and  only  if 
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sup 

i  n 

~^2£kf(xk) 

je? 

n  z — ' 

k=l  J 

Hint:  use  |  Y2=i£kf{xk)\  <  I  Y2=i£k{f(Xk)  -  ///} |  +  ||/||oo|  ELi£kl- 


c.  In  the  previous  problem  we  discussed  a  method  to  reverse  the  inequality 
between  Rademacher  and  Gaussian  symmetrization.  In  the  present  setting 
it  will  be  useful  to  prove  the  following  related  inequality:  for  any  M  >  0 
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sup  -'y)gkf(Xk) 
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+  M  E 
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1  ” 

-J2£kf(Xk) 

nti 


Hint:  insert  1  =  l\gk\<M  +  l|9fci >M  inside  the  Gaussian  symmetrization. 

d.  Show  that  fF  is  Glivenko-Cantelli  if  and  only  if 


E 

sup 

i  n 

-Y,9kf(Xk) 

_fe? 

U  k=l 

We  are  now  ready  to  give  a  necessary  and  sufficient  condition  for  the  Glivenko- 
Cantelli  propery  in  terms  of  the  random  geometry  of  the  set  5F:  we  claim  that 
fF  is  Glivenko-Cantelli  if  and  only  if  the  following  condition  (*)  holds: 


logos',  ||  •  \\l2(h„)’£) 


0  in  probability  for  every  e  >  0. 


Here  /in  denotes  the  empirical  measure  of  Xi, . . . ,  Xn. 

e.  Show  that  condition  (*)  is  sufficient  for  the  Glivenko-Cantelli  property. 
Hint:  use  Lemma  5.7. 


f.  Show  that  condition  (*)  is  necessary  for  the  Glivenko-Cantelli  property. 
Hint:  use  Sudakov’s  inequality. 


7.3  (Self- normalized  sums).  Consider  independent  Gaussian  random  vari¬ 
ables  X-[, ... .  Xn  with  E[Xj]  =  0  and  Var[X,;]  =  erf.  Obviously  we  have 


Y,Xi>t 


i—  1 


»= i 


1/2 


<  e 


-r/2 


for  all  t  >  0. 


Can  one  obtain  similar  inequalities  when  the  variables  Xt  are  not  Gaussian? 
By  Azuma’s  inequality  (Lemma  3.7),  we  obtain  the  same  result  if  Xi  is  erf- 
subgaussian.  However,  for  general  random  variables,  there  is  no  hope  to  obtain 
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such  an  inequality.  Indeed,  if  the  variables  Xt  have  heavy  tails,  for  example, 
then  clearly  the  sum  cannot  have  a  Gaussian  tail  for  large  t. 

Remarkably,  there  is  a  method  to  obtain  Gaussian  inequalities  of  this  type 
that  works  without  any  tail  assumption  on  the  random  variables!  The  key 
idea  is  to  choose  a  random  normalization  that  plays  the  role  of  the  sum  of  the 
variances  in  the  Gaussian  case.  We  then  say  the  sum  is  self-normalized. 

a.  Consider  first  the  simplest  case  of  independent  random  variables  Xi  that 
all  have  symmetric  distributions.  Show  that 

n  (  n  1/2' 

p  J2  Xi  ^  H  E  xi  \  -  e”‘2/2  for  a11 1  -  °- 

.  i—l  ^  i=l  ' 

Hint:  apply  Hoeffding’s  inequality  conditionally. 

b.  Prove  the  following  consequence  of  Lemma  7.6:  if  ci  >  1,  C2  >  0  are  con¬ 
stants  and  X ,  Y,  Z  are  random  variables  such  that  Y  is  nonnegative  and 

P[X  >  VtY]  <  c\e~C2t  for  all  t  >  0, 

then 

P[E[X| Z]  >  y/tE[Y\Z]\  <  cie1_C2<  for  all  t  >  0. 

Hint:  use  VtY  =  info>0{t/2a  +  aY/2}. 

c.  Let  Xi, . . .  ,Xn  be  any  independent  random  variables  with  E[X^]  =  0  and 
E[Xf]  =  erf.  Prove  the  following  self-normalized  inequality: 

n  (  n  \  l/2l 

P  ^  Xi  >  'y^fXf  +  erf)  >  <  e1_t  /2  for  all  t  >  0. 

.  i=l  '  i—l  ' 

7.4  (The  contraction  principle).  Let  g±, . . .  ,gn  be  i.i.d.  N( 0, 1).  Consider 

n 

E  Slip'S^  Qiti 

for  a  subset  T  C  Kn.  In  the  best  case  T  =  {— t,  t},  the  magnitude  of  this  quan¬ 
tity  is  of  order  \/n.  We  informally  view  this  rate  as  arising  from  cancellation 
of  terms  in  the  sum  with  opposite  signs.  When  the  set  T  is  “large,”  however, 
this  quantity  can  be  much  larger  than  -^/n  as  the  supremum  can  cancel  some 
of  the  signs.  For  example,  in  the  extreme  case  that  T  =  {—1, 1}”,  we  can 
cancel  the  signs  exactly  and  the  above  quantity  is  of  order  n. 

The  above  dicussion  suggests  that  a  class  T  with  “less  variability”  should 
lead  to  a  smaller  Gaussian  supremum.  One  simple  result  along  these  lines  is 

E  sup 

teT 

This  statment  is  an  easy  consequence  of  Slepian’s  inequality. 


^2gr\U\  <E  sup  E  9iU 
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a.  Prove  the  above  bound. 

We  now  turn  our  attention  to  the  Rademacher  process 


E 


sugS 


£  j  tj 


t£T 


fc= 1 


Is  there  an  analogue  for  the  Rademacher  process  of  the  property  proved  in 
part  a.?  It  is  not  immediately  clear  how  to  proceed,  as  there  is  no  Slepian  in¬ 
equality  for  Rademacher  processes  (in  fact,  the  absence  of  such  an  inequality 
presents  a  major  challenge  in  the  deeper  understanding  of  such  processes!) 
However,  there  is  a  less  powerful  comparison  inequality  for  Rademacher  pro¬ 
cesses,  called  the  contraction  principle,  that  can  sometimes  play  an  analogous 
role  to  Slepian’s  inequality  in  this  setting.  We  develop  it  presently. 

b.  Let  T  be  a  bounded  subset  of  R2,  and  let  tp  :  R  — >  R  be  1-Lipschitz.  Prove 


sup{ti  +  ip(t2)}  +  sup{fi  -  <p{t2)}  <  sup{ti  +  t2}  +  sup{fi  -  t2}. 

t&T  teT  teT  teT 

c.  Let  <pi  :  R  — >  R  be  1-Lipschitz  for  i  <  n.  Prove  the  contraction  principle 


E 

n 

SUpY'  £i<Pi(ti) 

<  E 

n 

SUpY'  £iti 

LteT^t  J 

J 

Hint:  apply  the  previous  part  conditionally  on  sq, . . . ,  £j_i,£i+i, . . . ,  en. 

d.  Deduce  the  Rademacher  analogue  of  the  above  Gaussian  inequality: 


E 

n 

sup»N 

<  E 

n 

SUpY'  £ltl 

J 

LteT^t  J 

e.  Let  T  be  a  uniformly  bounded  class  of  functions  with  \\f\\oo  <  M  for  all 
/  £  J.  In  various  applications,  it  proves  to  be  important  to  control  the 
empirical  process  over  the  family  of  squares  f2.  Show  that 


E 


supE(/(^)2 

; 


t<f)} 


<  4MFj 


sup  y^ekf{xk) , 

J 


so  that  it  is  possible  to  control  the  empirical  process  using  the  covering 
numbers  of  T  itself  (rather  than  the  covering  numbers  of  If2  =  {/2  :/£  ?} 
that  would  arise  from  a  direct  application  of  symmetrization). 

Let  us  note  that  with  a  bit  more  work,  we  can  also  deduce  a  version  of  the 
contraction  principle  that  makes  it  possible  to  obtain  tail  bounds  by  including 
a  convex  function  as  we  did  for  symmetrization  in  the  proof  of  Corollary  7.7. 
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In  the  previous  section,  we  saw  that  we  can  bound  using  symmetrization 


E 


sup  G„(f) 
fe? 


<  E 


,/log  A7(^,  ||  •  \\L2{liri),e)de 


This  is  a  vast  improvement  over  the  result  that  we  would  have  obtained  by 
chaining  directly  using  the  Azuma-Hoeffding  inequality,  in  which  case  the  cov¬ 
ering  number  would  be  replaced  by  the  much  larger  quantity  iV(T,  ||  •  ||oo,£). 
The  difficulty  in  applying  the  above  bound,  however,  is  that  we  must  control 
the  random  covering  numbers  IV  (T,  ||  •  ||l2(/i„)>£)-  Unfortunately,  it  is  often 
difficult  to  obtain  bounds  that  exploit  the  specific  structure  of  the  random  ge¬ 
ometry  of  (T,  L2(/in)).  We  therefore  concentrate  on  the  intermediate  problem 
of  controlling  the  random  covering  numbers  uniformly : 


7V(T,||  •  \\L^n),e)  <  ||iV(T,||  •  |U2(^),£)||oo  <  N(9,  ||  •  \\oo,e). 

At  first  sight,  one  might  expect  that  uniform  control  of  the  random  cover¬ 
ing  numbers  would  essentially  reduce  to  covering  in  the  uniform  norm,  as  all 
the  structure  of  the  original  distribution  /i  is  lost.  Surprisingly,  this  intuition 
proves  to  be  incorrect:  in  many  cases,  the  combinatorial  structure  of  the  class 
2r  makes  it  possible  to  control  its  uniform  covering  numbers  very  effectively, 
while  covering  in  the  uniform  norm  leads  to  useless  bounds.  We  have  seen  in 
Example  7.2  that  the  latter  difficulty  already  arises  in  an  extreme  manner  for 
classes  of  indicator  functions.  We  therefore  begin  in  this  section  by  investi¬ 
gating  this  situation:  that  is,  we  will  assume  that  T  =  {1q  ■  C  €  6}  for  a 
class  of  sets  C.  Such  problems  are  of  significant  interest  in  their  own  right  in 
many  applications,  and  also  serve  to  illustrate  the  ideas  that  we  are  about 
to  develop  in  the  simplest  possible  setting.  In  the  following  section,  we  will 
extend  these  ideas  to  general  classes  of  functions. 

As  we  will  be  working  exclusively  with  sets  in  this  section,  we  will  simplify 
our  notation  by  implicitly  identifying  sets  and  their  indicator  functions;  in 
particular,  we  denote  by  (C,  ||  •  ||)  the  class  of  sets  C  with  the  metric  ||lc~  Ic'll- 
Let  us  begin  by  recalling  the  difficulty  with  using  the  uniform  norm:  clearly 
||  lc  —  lc'lloo  =  1  whenever  C  C",  so  a  moment’s  reflection  will  show  that 

7V(C,  ||  •  Hoo.e)  =  |C|  fared. 

As  |  G |  =  oo  in  most  cases  of  interest,  this  is  useless.  How  can  symmetrization 
beat  this  limitation?  In  fact,  symmetrization  can  help  us  in  two  distinct  ways: 

1.  The  symmetrized  bound  requires  covering  only  in  L2  rather  than  L°°. 


2.  The  symmetrized  bound  involves  only  norms  supported  on  the  finite  set 
supp nn  =  {X i, . . . ,  Xn}  rather  than  the  entire  space  X. 
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The  combination  of  these  two  ideas  will  lead  to  a  powerful  machinery  to 
control  the  covering  numbers  in  the  symmetrization  bound.  In  order  to  gain 
insight  into  the  roles  played  by  each  of  these  ideas,  we  will  begin  by  disre¬ 
garding  the  first  point  completely  and  see  how  far  we  can  get  by  exploiting 
only  the  reduction  in  complexity  provided  by  the  second  point.  Once  this  idea 
has  been  understood,  we  will  return  to  the  first  point  and  show  how  it  can  be 
exploited  to  further  reduce  the  complexity  of  the  problem. 

In  order  to  exploit  the  reduction  of  the  underlying  space  to  a  finite  set,  let 
us  bound  the  random  covering  numbers  in  the  most  naive  manner  possible: 


N(e,  II  •  \\L2M,e)  <  N(e,  II  •  iu=c(^),£)  =  |e n  {xu . . .,xn}\. 

As  6  n  {Xi, . . . ,  Xn}  consists  of  subsets  of  at  most  n  points,  the  above  quan¬ 
tity  is  bounded  by  at  most  2".  Thus  this  naive  bound  already  improves  over 
covering  in  the  uniform  norm  on  the  entire  space  X!  Unfortunately,  bounding 
the  covering  number  by  2”  does  not  lead  to  any  nontrivial  result.  Indeed,  as 
the  diameter  of  the  set  C  is  bounded  by  one,  we  can  estimate 


E 


sup  Gn(C) 
cee 


<  Eh/iog |en x„}|]  <  Vn, 


which  we  could  have  seen  immediately  from  the  definition  of  the  empirical 
process  (as  ||Gn(C')||00  <  y/n).  Of  course,  we  cannot  expect  anything  better 
at  this  level  of  generality:  if  C  is  the  class  of  all  (measurable)  subsets  of  X,  then 
clearly  supCgg  Gn(C)  =  yfn  for  any  nonatomic  measure  p .  In  order  to  obtain 
nontrivial  result,  we  must  exploit  the  structure  of  the  set  6.  Remarkably,  it 
turns  out  that  in  many  cases  the  quantity  |C  D  {Xi, . . . ,  Xn}\  is  much  smaller 
than  2™.  Before  we  attempt  to  understand  this  phenomenon  in  a  general 
setting,  let  us  develop  some  intuition  in  two  illuminating  examples. 

Example  7.8  (The  empirical  distribution  function).  Let  us  revisit  the  setting 
of  Example  7.2  where  X  =  R.  and  C  =  {]—  oo,  x\  :  x  €  R}.  Clearly 

e  0  {Xi, . . .  ,Xn}  =  {{X(ra), . . . , -X'(fc)}  :  k  =  1, . . .  ,n}  U  {0}, 

where  X^  >  ■  ■  ■  >  is  the  decreasing  rearrangement  of  Xi, . . . ,  Xn.  Thus 
we  have  shown  in  this  case  that  |C  D  {Xi, . . . ,  X„}|  <  n  +  1,  which  is  much 
smaller  than  2"!  In  particular,  this  implies  the  nontrivial  result 


eik-fiu 


;E 


In 


sup  |Gn(C)| 
cee 


< 


It  turns  out  that  the  rate  that  we  obtained  here  is  not  optimal:  we  lost 
a  logarithmic  factor  when  we  bounded  the  L2  (pn  )-covering  number  by  the 
L00(/un)-covering  number.  This  inefficiency  will  be  addressed  later  in  this  sec¬ 
tion.  Nonetheless,  the  simple  argument  given  here  already  suffices  to  prove 
the  classical  Glivenko-Cantelli  theorem  discussed  in  Example  7.2  (it  is  left  as 
an  exercise  to  deduce  a.s.  convergence  from  convergence  of  the  mean  using 
McDiarmid’s  inequality  and  the  Borel-Cantelli  lemma). 
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Example  7.9  (Rectangles).  Let  X  =  R2  and  let 

C  =  { [a,  b]  x  [c,  d]  :  a  <  b,  c  <  d} 

be  the  class  of  axis-parallel  rectangles.  We  claim  that  in  this  case 

|en{x1,...,x„}|  < n4. 

To  see  why  this  is  the  case,  let  us  use  a  simple  counting  argument.  Fix  a 
configuration  of  points  X  -| , . . . ,  Xn.  To  every  rectangle  C  £  C,  we  can  associate 
uniquely  another  rectangle  C'  that  is  the  smallest  rectangle  such  that  C  D 
{Xi, . . . ,  Xn}  =  C'  D  {A'i, . . . ,  Xn}.  This  is  illustrated  in  the  following  figure: 


Note  that  |Cn{Xi, . . .  ,X„}|  is  equal  to  the  number  of  minimal  rectangles  C' . 
Each  C'  can  be  represented  by  specifying  four  points  in  {Ad, . . . ,  Xn},  one 
for  each  edge.  Thus  there  are  at  most  n4  such  possibilities.  (To  be  precise,  we 
must  account  separately  for  the  case  C  =  0;  however,  as  not  every  4-tuple  of 
points  defines  a  valid  rectangle,  the  crude  upper  bound  n4  is  still  valid.) 

In  view  of  this  simple  estimate,  we  can  now  bound  the  supremum  of  the 
empirical  process  over  rectangles  precisely  as  in  the  previous  example. 

It  appears  in  these  examples  that  the  quantity  |CD{Ai, . . . ,  Xn}\  somehow 
captures  the  number  of  degrees  of  freedom  of  the  class  C.  In  the  first  example 
there  was  only  one  parameter  ieR,  and  the  number  of  sets  was  ~  n.  In  the 
second  example  there  were  four  parameters  a,  b,c,d  £  R,  and  the  number  of 
sets  was  ~  n4.  This  is  not  a  coincidence:  it  is  typically  the  case  that  a  class  of 
sets  C  of  “dimension”  d  satisfies  |C D  {Xi, . . . ,  Xn}\  ~  nd.  To  understand  this 
phenomenon  for  general  classes  of  sets,  we  must  understand  how  to  define  an 
intrinsic  notion  of  “dimension”  that  does  not  depend  on  a  parametrization. 
To  this  end,  we  introduce  a  combinatorial  notion  of  dimension. 

Definition  7.10  (Shattering).  A  set  I  C  X  is  said  to  be  shattered  by  C  if 
CD  /  =  21 ,  that  is,  if  for  every  J  C  I,  there  exists  C  £  C  such  that  C(ll  =  J. 

Definition  7.11  (VC-dimension).  The  Vapnik-Chervonenkis  dimension  or 
VC-dimension  of  C  is  defined  as  vc(C)  :=  sup{|/|  :  /  is  shattered  by  C}. 

In  words,  vc(C)  is  the  cardinality  of  the  largest  set  of  points  so  that  we 
can  recover  all  possible  subsets  of  these  points  by  intersecting  with  elements 
of  C.  Another  way  to  view  the  VC-dimension  vc(C)  is  as  the  largest  integer 
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n  such  that  |C  D  {aq, . . . , xn}\  =  2"  for  some  set  of  points  x\,...,xn  €  X.  If 
vc(C)  =  oo,  then  it  is  quite  possible  that  |Cn{Xl7 . . . ,  Xn}\  ~  2”  for  all  n,  and 
there  is  nothing  nontrivial  to  be  gained  from  the  present  approach  (at  least 
without  exploiting  specific  properties  of  the  random  samples  X\,,. .  ,Xn).  It 
is  not  at  all  obvious  at  this  point,  however,  that  we  are  any  better  off  in 
the  situation  where  vc(C)  <  oo:  even  if  | C  fl  {aq, . . .  ,xn}\  <  2"  for  all  points 
aq, . . . ,  xn,  what  is  preventing  us  from  having,  say,  |C  D  {aq, . . . ,  xn}\  >  2 ”/2? 
It  is  a  remarkable  combinatorial  fact  that  this  situation  cannot  occur:  a  class 
of  sets  with  vc(C)  =  d  always  satisfies  |C  fl  {aq, . . . ,  xn}\  <  nd. 

Lemma  7.12  (Sauer-Shelah).  For  all  n  >  1  and  aq, ...  ,xn  £  X 


The  proof  of  Lemma  7.12  is  an  exercise  in  combinatorics:  we  must  find 
an  effective  way  to  count  the  subsets  |C  D  {aq, . . . ,  xn}\.  We  will  postpone  the 
proof  of  this  result  until  the  end  of  this  section,  so  that  we  can  focus  our 
attention  on  its  implications  for  the  control  of  empirical  processes.  Before  we 
continue  down  this  road,  however,  it  is  instructive  to  verify  the  validity  of  the 
Sauer-Shelah  lemma  in  the  two  examples  discussed  above. 

Example  7.13  (The  empirical  distribution  function) .  In  the  setting  of  Example 
7.8,  it  is  easily  seen  that  vc(C)  =  1.  Indeed,  clearly  any  singleton  {z}  is 
shattered,  as  {3}  fl  ]— 00,3  —  1]  =  0  and  {3}  n  ]— 00,3]  =  {3}.  On  the  other 
hand,  no  set  of  two  points  {3i,32}  is  shattered:  after  all,  if  3i  <  32,  then  the 
set  {32}  cannot  be  recovered  by  intersecting  with  any  set  in  C. 

Example  7.1)  (Rectangles).  In  the  setting  of  Example  7.9,  we  claim  that 
vc(C)  =  4.  It  is  easy  to  construct  a  set  of  four  points  that  is  shattered  (for  ex¬ 
ample,  {(0, 1),  (0,  —1),  (1, 0),  (—1, 0)}).  On  the  other  hand,  choose  any  set  I  of 
five  points,  and  let  C  be  the  smallest  rectangle  enclosing  I.  Then  at  least  four 
points  in  I  touch  the  boundary  of  C.  But  any  rectangle  that  contains  these 
four  points  must  necessarily  also  contain  the  fifth,  so  I  cannot  be  shattered. 

As  can  be  seen  in  these  examples,  the  VC-dimension  of  a  class  of  sets 
is  often  easy  to  compute.  The  beauty  of  this  notion  is  that  shattered  sets, 
which  are  “witnesses”  to  high-dimensional  behavior,  are  very  rigid  objects, 
and  it  is  therefore  often  straightforward  to  rule  out  their  existence  in  specific 
situations.  The  combinatorial  principle  expressed  by  the  Sauer-Shelah  lemma 
is  consequently  a  powerful  tool  not  just  in  theory  but  also  in  practice. 

Let  us  now  return  to  the  control  of  the  empirical  process.  By  combining  the 
Sauer-Shelah  lemma  with  our  symmetrization  bound,  we  immediately  obtain 

supE  sup {^in(C)  -  p.(C)} 
m  Lcee 
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where  the  supremum  is  taken  over  all  probability  measures  n  on  X.  This  result 
shows  not  only  that  the  law  of  large  numbers  holds  uniformly  over  classes  of 
sets  C  with  finite  VC-dimension — a  far-reaching  generalization  of  the  original 
result  of  Glivenko  and  Cantelli  discussed  in  Example  7.2 — but  we  even  obtain 
a  bound  on  the  rate  of  convergence  that  is  completely  independent  of  the 
distribution  of  the  underlying  independent  variables!  Classes  C  that  satisfy 
this  property  are  called  uniform  Glivenko- Cantelli  classes. 

Remark  7.15.  The  independence  of  our  bounds  of  the  distribution  /r  can  be 
both  a  positive  and  negative  feature.  In  applications  in  statistics  or  machine 
learning,  where  only  independent  samples  are  available  and  the  underlying 
distribution  /r  is  unknown,  distribution-free  estimates  make  it  possible  to  eval¬ 
uate  the  error  of  statistical  procedures  without  making  any  assumptions  on 
the  clata-generating  mechanism.  On  the  other  hand,  it  is  certainly  possible 
for  a  class  C  to  satisfy  the  ^,-Glivenko-Cantelli  property  for  some  distributions 
/.t  and  not  for  others,  and  the  VC-dimension  cannot  capture  this  behavior. 
In  such  situations,  we  cannot  ignore  the  law  of  the  samples  X\, . . .  ,Xn:  the 
random  geometry  must  be  genuinely  understood  to  obtain  nontrivial  results. 
We  will  encounter  an  example  in  which  this  can  be  done  in  Problem  7.10. 

Despite  that  we  have  obtained  a  decidedly  nontrivial  result  from  a  direct 
application  of  the  Sauer-Shelah  lemma,  it  turns  out  that  this  result  is  not 
sharp:  the  optimal  rate  in  the  uniform  law  of  large  numbers  for  classes  of  finite 
VC-dimension  is  in  fact  the  usual  1  fyfn  central  limit  theorem  rate!  Thus  we 
have  apparently  picked  up  an  extra  factor  of  order  -y/log  n.  This  origin  of  this 
inefficiency  does  not  lie  in  the  Sauer-Shelah  lemma:  our  combinatorial  bound 

N(e,\\  ■  IU~(/^n),£)  <«vc(e) 

is  sharp,  as  can  be  seen  in  Examples  7.8  and  7.9.  The  problem  lies  in  the  very 
first  step  of  our  analysis,  where  is  applied  the  crude  estimate 

II  •  |Ua(#iB),e)  <  N(e,  II  •  ||loo(Mb),£). 

The  L2-covering  numbers  prove  to  be  much  smaller  than  the  L °°  -covering 
numbers:  while  the  latter  must  necessarily  grow  with  n,  the  former  do  not 
depend  on  n  at  all!  In  fact,  it  turns  out  that  the  space  (C,  ||  •  \\l2(h))  has 
metric  dimension  oc  vc(C),  uniformly  over  all  probability  measures  /a. 

Theorem  7.16  (Dudley).  There  is  a  universal  constant  K  such  that 

/ K\  Kvc(e) 

sup  N(e,  ||  •  1 1  £2  („),£)  <  (  —j  for  all  e  <  1. 

Where  did  the  dependence  on  n  disappear  to?  The  idea  is  suprisingly 
simple.  Suppose  that  {Ci, . . .  ,Cm }  is  a  maximal  ^-packing  of  (C,  ||  • 
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that  is,  ||lc,  —  1  Cj  IU2(m)  >  £  f°r  all  *  ¥"  3-  If  we  draw  r  random  samples  from 
H,  then  the  law  of  large  numbers  ensures  that  we  have 

||lct  -  1c.,-  ||l2(m)  ~  IlfCi  -  lCj||i2(^)- 

Thus  if  we  choose  r  large  enough,  then  we  can  ensure  that  {C\, . . .  ,Cm}  is 
still  an  e/2-packing  of  (6,  ||  •  || L2(/xr))i  and  in  this  case  we  obtain 

N(e,  II  •  \\L2W,e)  <  N(e,  II  •  ||L2(Air),£/4)  <  JV(e,  II  •  |Uao(#ip),e/4)  <  rvc(e). 

The  key  insight  is  now  that  the  number  of  samples  r  that  we  need  to  draw  in 
order  to  ensure  that  this  estimate  holds  depends  only  on  e  and  m — the  original 
sample  size  n  of  the  empirical  process  is  completely  irrelevant!  In  particular, 
just  as  we  previously  exploited  the  fact  that  symmetrization  reduces  the  space 
X  to  a  finite  set  {Xi, . . . ,  X„}  of  cardinality  n,  we  now  reduce  the  size  of 
the  space  even  further  by  throwing  out  those  points  that  are  not  needed  to 
maintain  the  separation  between  the  sets  CV  The  gain  obtained  from  this 
reduction  accounts  precisely  for  the  improvement  in  Theorem  7.16.  This  idea, 
called  probabilistic  extraction ,  is  made  precise  by  the  following  lemma.  For 
future  reference,  we  formulate  it  for  general  functions  rather  than  sets  (see 
Problem  7.6  for  a  somewhat  sharper  bound  that  is  specific  to  sets). 

Lemma  7.17  (Extraction).  Let  /i, . . . ,  fm  be  functions  on  X  such  that 

ll/illoo  <  1,  Wfi  -  /ilU2^)  >  £  for  alll<i  <  j  <m. 

Then  there  exist  r  <  ce-4  log  m  points  xi, . . . ,  xr  £  X  such  that 

Wfi  ~  /jI|l2(m*)  >  e/2  for  all  1  <  i  <  j  <  m, 

where  fix  :=  -  and  c  is  a  universal  constant. 

Proof.  Let  X1? . . . ,  Xr  ~  p  be  i.i.d.,  and  denote  by  pr  their  empirical  measure. 
Then  we  can  estimate  using  the  Azuma-Hoeffding  inequality 

p[ll/i-/llli2bv)  <  <  e~re4/15 

for  every  i^j.  A  union  bound  now  gives 

P  Wfi  -  /ilU2(Mr)  >  ^  for  all*  ^  j  >  1  -  m2e~re4/15  >  0 
for  r  >  30e-4  log  m,  and  the  result  follows  readily.  □ 


We  can  now  easily  complete  the  proof  of  Theorem  7.16. 
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Proof  (Theorem  7.16).  Let  p  be  any  probability  on  X,  and  let  Ci, . . . ,  Cm  be  a 
maximal  ^-packing  of  (6,  ||  •  \\L2(fi))-  By  Lemma  7.17,  there  exist  r  <  ce~4  log  to 
points  xi,...  ,xr  so  that  C\, . . . ,  Cm  is  still  a  packing  of  (C,  ||  •  \\L2^^).  Thus 


to  <  |en  {a’i, . . . , xr}\  < 


VC^G'1  /log to  (ec)1/4 
-  ^vc(e)  £ 


4vc(e) 


by  the  Sauer-Shelah  lemma.  But  using  alog?n  <  to“,  we  obtain 


and  the  proof  is  complete  as  to  >  N( S,  ||  •  \\l2(/j,)>£)  by  Lemma  5.12.  □ 

With  the  proof  of  Theorem  7.16  being  complete,  we  have  now  accom¬ 
plished  what  we  set  out  to  do  at  the  beginning  of  this  section:  we  obtained 
uniform  control  over  the  L2-covering  numbers  of  a  class  of  sets  C  in  terms  of 
its  combinatorial  structure.  In  particular,  we  can  now  obtain  the  optimal  rate 
in  the  uniform  law  of  large  numbers  for  classes  of  finite  VC-dimension. 

Corollary  7.18  (Uniform  Glivenko-Cantelli  classes).  There  is  a  univer¬ 
sal  constant  L  such  that  for  any  class  C  of  measurable  subsets  ofH  and  n  >  1 


sup  E  sup  | Hn(C)  -  /.i(C)  | 
a  Lcee 


where  the  supremum  is  taken  over  all  probability  measures  p  on  X. 


Proof.  Using  symmetrization  and  Theorem  7.16  we  obtain 

K'  r  rl 


E 


SUp  | pn(C)  -  p(C)  | 
cee 


<  zz—  E 
'n 


< 


vc(e) 


^/log  A7(C,  ||  •  \\L2 {tln),e)d£ 
1 


k'Vk 


log  —  de, 
£ 


where  K'  is  the  universal  constant  that  arises  in  Corollary  5.25  and  we  have 
used  that  the  diameter  of  (6,  ||  •  \\l2(^))  is  at  most  one.  □ 

It  remains  to  take  care  of  unfinished  business:  we  must  still  prove  the  Sauer- 
Shelah  lemma.  The  remainder  of  the  section  will  be  devoted  to  this  task.  There 
are  in  fact  a  number  of  different  proofs  of  the  Sauer-Shelah  lemma,  each  of 
which  is  interesting  in  its  own  right.  We  will  develop  in  some  detail  a  proof 
that  is  loosely  reminiscent  of  the  lower  bound  construction  in  the  proof  of 
the  majorizing  measure  theorem.  In  the  case  of  classes  of  sets,  this  proof  is 
somewhat  pedantic;  the  same  basic  step  can  be  used  to  give  a  shorter  proof 
by  induction  on  the  dimension  (Problem  7.7).  However,  the  ideas  that  we 
will  develop  will  prove  to  be  particularly  useful  in  the  next  section  when  we 
attempt  to  extend  the  conclusion  of  Theorem  7.16  to  classes  of  functions. 

The  conclusion  of  the  Sauer-Shelah  lemma  is  in  fact  an  immediate  conse¬ 
quence  of  the  following  more  precise  combinatorial  principle. 
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Theorem  7.19  (Pajor).  For  any  class  G  of  subsets  ofX,  we  have 
| C |  <  |{/  C  X  :  I  is  shattered  by  C}|. 

Let  us  see  why  Lemma  7.12  follows. 

Proof  (Lemma  7.12).  By  the  definition  of  the  VC-dimension,  every  shattered 
set  I  must  satisfy  |/|  <  vc(C).  Thus  Theorem  7.19  implies 

|C  fl  {xi, . . .  ,xn}\  <  |{/  C  {xi, . . .  ,xn}  :  /  is  shattered  by  C}| 

vc(e) 

<  I  {I  C  {xi,...,x„}  :  |/|  <  vc(e)}|  =  ^2 

k= 0 

The  remaining  bound  in  Lemma  7.12  is  an  elementary  consequence  of  the 
binomial  theorem:  for  any  d  <  n  we  can  estimate 


Thus  the  proof  of  Lemma  7.12  is  hereby  complete.  □ 

Remark  7.20.  It  is  not  difficult  to  see  that  Theorem  7.19  and  Lemma  7.12  are 
sharp.  Indeed,  consider  the  class  C  =  {/  C  {1, . . . ,  n}  :  |/|  <  d}.  Then  every 
subset  of  cardinality  d  is  shattered,  and  clearly  no  set  of  cardinality  greater 
than  d  can  be  shattered.  Thus  vc(C)  =  d,  and  in  this  example  the  result  of 
Theorem  7.19  and  the  first  inequality  in  Lemma  7.12  hold  with  equality. 

We  now  finally  turn  to  the  heart  of  the  matter,  which  is  to  prove  Theorem 
7.19.  The  essential  difficulty  that  we  face  is  to  devise  an  efficient  way  to 
organize  our  counting  of  the  number  of  shattered  sets.  This  requires  some 
form  of  bookkeeping.  To  this  end,  we  will  build  a  tree  (cf.  Definition  6.37) 
of  subsets  of  G — that  is,  each  node  of  the  tree  represents  a  family  of  sets  in 
G — that  encodes  information  about  what  points  are  shattered. 

Definition  7.21  (Splitting  tree).  Let  G  be  a  class  of  subsets  ofX.  A  G-tree 
A  is  called  a  splitting  tree  if  every  node  A  €  A  that  is  not  a  leaf  satisfies: 

1.  A  has  exactly  two  children  A+  and  A- ; 

2.  There  exists  xj\  €  X  so  that  xj\  £  C  for  C  €  A+  and  xj\.  fL  C  for  C  €  A _ . 

The  motivation  for  this  definition  is  that  a  set  I  =  {xi, . . . ,  xn }  is  shattered 
if  and  only  if  there  exists  a  splitting  tree  A  with  the  following  properties: 

1.  A  is  a  complete  binary  tree  of  depth  n. 

2.  {xA  :  A  e  A}  =  {xi, . . .  ,xn}. 
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Indeed,  suppose  such  a  tree  exists.  Then  for  any  J  C  /,  we  can  find  a  set 
C  €  6  such  that  Cfll=  J  (thereby  verifying  that  I  is  shattered)  by  using 
the  tree  as  a  lookup  table:  starting  at  the  root  6,  traverse  down  the  unique 
path  in  the  tree  such  that  at  every  node  A,  we  move  to  A+  if  Xj\  £  J  and  to 
A-  otherwise.  We  end  up  at  a  leaf  Aj  of  the  tree,  and  by  construction  any 
C  £  Aj  satisfies  C  fl  I  =  J.  Conversely,  if  I  is  shattered,  then 

A  =  {{C  e6:Cfl  {aq, . . . ,  Xi}  =  J}  :  0  <  i  <  n,  J  C  {xi, . . . ,  Xi}} 

evidently  defines  a  splitting  tree  with  the  above  two  special  properties. 

In  view  of  the  above  discussion,  finding  shattered  sets  is  equivalent  to 
finding  complete  splitting  trees.  The  difficulty  is  that  complete  splitting  trees 
are  hard  to  find.  However,  it  is  very  easy  to  construct  a  splitting  tree  without 
the  above  special  properties  by  repeatedly  splitting  each  node  of  the  tree  into 
two  subsets  in  a  “greedy”  fashion  starting  at  the  root.  The  idea  behind  the 
proof  of  Theorem  7.19  is  to  show  that  any  large  splitting  tree  must  contain 
many  subtrees  that  are  complete.  This  is  a  simple  example  of  the  Ramsey 
phenomenon  that  arises  in  many  combinatorial  problems,  which  states  that 
that  any  “large”  system  must  contain  large  “highly  structured”  subsystems. 

Lemma  7.22.  Let  C  be  a  class  of  subsets  of  X.  Then  for  any  splitting  tree  A 

| {leaves  o/A}|  <  |{/  C  X  :  I  is  shattered  by  C}|. 

Proof.  It  is  convenient  to  define  for  A  C  G 

S(A)  :=  {I  C  X  :  /  is  shattered  by  A}, 

where  we  note  that  0  £  S(.A)  for  any  A.  The  key  point  of  the  proof  is  that 

|§(yi)|  >  \s(a+)\  +  |S(.a_)| 

holds  for  every  node  A  £  A  that  is  not  a  leaf.  To  see  this,  note  that  if  a  set  I  is 
shattered  by  a  subfamily  of  A,  then  it  is  shattered  by  A  as  well  by  definition. 
Thus  the  only  issue  we  have  to  address  is  that  sets  I  that  are  shattered  both 
by  A+  and  A-  are  double-counted  in  the  lower  bound.  On  the  other  hand,  if 
I  is  shattered  by  both  A+  and  A-,  then  it  is  easily  verified  that  both  /  and 
I  U  {xyi}  are  shattered  by  A.  Thus  the  claim  is  valid.  To  complete  the  proof, 
it  remains  to  iterate  the  above  inequality  starting  from  the  root.  This  yields 

|S(C)|  >  ^  |S(A)|  >  |{leaves  of  A} | , 

A,  is  a  leaf 

where  we  have  used  that  |S(.A)|  >  1  (because  0  £  S(.A)).  □ 

To  complete  the  proof  of  Theorem  7.19,  it  remains  to  construct  a  splitting 
tree  with  |C|  leaves.  But  this  is  trivial:  the  most  naive  construction  works. 
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Lemma  7.23.  For  any  class  of  sets  C,  there  exists  a  splitting  tree  A  with 

| {leaves  of  A}|  =  |C|. 

Proof.  It  is  trivial  that  for  any  subset  ACC  with  \A\  >  2,  we  can  choose 
Xj\  £  X  such  that  A+  =  {C  £  A  :  Xa  €  C}  and  A _  =  {C  £  A  :  Xa  ^  C}  are 
nonempty:  indeed,  it  suffices  to  choose  any  xa  G  CAC'  for  distinct  elements 
C,  C'  £  A.  Thus  we  can  grow  a  splitting  tree  by  starting  at  the  root  G  and 
repeatedly  splitting  the  leaves  of  the  tree  into  two  subsets  until  all  the  leaves 
are  singletons.  As  we  have  not  thrown  out  any  elements  of  6,  the  leaves  form 
a  partition  of  6,  and  as  each  leaf  is  a  singleton  the  conclusion  follows.  □ 

Combining  Lemmas  7.22  and  7.23  concludes  the  proof  of  Theorem  7.19. 

Problems 

7.5  (Computing  the  VC-dimension).  The  aim  of  this  problem  is  to  com¬ 
pute  the  VC-dimension  of  various  classes  of  sets  G.  We  begin  with  a  simple 
observation  that  is  useful  in  many  geometric  situations. 

a.  Let  6  be  a  class  of  convex  subsets  of  Rd.  Show  that  if  I  c  Rd  is  shattered 
by  C,  then  every  x  £  I  must  be  an  extreme  point  of  the  convex  hull  conv  I . 

We  now  consider  several  interesting  examples  of  classes  of  convex  sets. 

b.  Show  that  vc(C)  =  3  for  the  class  of  discs  in  the  plane 

G  =  {{x  £  R2  :  ||x  -  z\\  <r}:zeR2,  r  £  R+}. 

Hint:  suppose  that  {xi,  x2,  x3i  £4}  are  the  corners  of  a  convex  polygon, 
listed  in  clockwise  order.  Suppose  that  ||xi  —  a?3 1|  >  ||x2  —  X4||.  Show  that 
no  disc  can  contain  xi,x3  without  containing  either  X2  or  X4. 

c.  Show  that  vc(C)  =  d+  1  for  the  class  of  d-dimensional  halfspaces 

C  =  {{x  £Rd  :  (z,x)  >  a}  :  z  £  Rd,  a  £  R}. 

Hint:  consider  {0,  ei, . . . ,  e^}  (where  {e.;}  denotes  the  unit  basis  in  Rd).  On 
the  other  hand,  for  any  {x\, . . . ,  Xd+2},  one  can  find  b  £  Rd+2\{0}  such 
that  b\X\  +  •  •  •  +  bd+2%d+2  =  0  and  61  +  •  •  •  +  ^<2+2  =  0. 

d.  Show  that  vc(C)  =  7  for  the  class  of  all  triangles 

C  =  {convjxi,  x2,  x3}  :  xi,x2,x3  £  M2}. 

Hint:  consider  7  points  lying  on  a  circle.  On  the  other  hand,  let  {xi, . . . ,  Xs} 
be  the  corners  of  a  convex  polygon,  listed  in  clockwise  order.  Show  that  no 
triangle  can  contain  Xi,x3,Xs,X7  but  exclude  X2,X4,X6,Xs,  as  every  pair 
Xi,x,;_|_2  must  be  separated  from  x*+i  by  an  edge  of  the  triangle. 
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Let  us  note  that  triangles  are  naturally  described  by  6  parameters,  while  the 
VC-dimension  is  7.  Similarly,  halfspaces  can  be  described  by  d  parameters 
(as  we  may  assume  without  loss  of  generality  that  ||z||  =  1),  while  the  VC- 
dimension  is  d  +  1.  Thus  it  is  not  always  the  case  that  the  VC-dimension  of  a 
parametrized  family  of  sets  equals  the  number  of  parameters. 

The  following  construction  provides  a  useful  method  to  generate  classes  of 
sets  with  small  VC-dimension  that  can  have  complicated  structure. 

e.  Let  X  be  any  set,  and  let  g±, . . . ,  gd  :  X  — >  R.  be  arbitrary  functions.  Show 
that  vc(C)  <  d  if  we  define  the  class  of  upper  level  sets 

C  =  {{a;  G  X  :  6151  (x)  H - b  bdgd(x)  >0}:b1,...,bdG  R}. 

Use  this  to  give  another  proof  of  the  VC-dimension  of  discs  in  part  b. 

Finally,  we  note  that  even  “nice”  sets  can  have  infinite  VC-dimension. 

f.  Show  that  vc(C)  =  00  for 

C  =  {C  C  R2  :  C  is  compact  and  convex}. 

Hint:  consider  n  points  on  a  circle. 

7.6  (A  sharper  uniform  covering  bound).  Theorem  7.16,  as  we  have 
stated  it,  implies  that  the  metric  dimension  of  (6,  j|  •  ||i2(/1))  is  at  most  K  vc(C) 
uniformly  over  all  probability  measures  g.  The  constant  K  that  we  obtained  in 
not  sharp.  The  reason  for  this  is  that  we  have  used  a  very  general  probabilistic 
extraction  principle  in  the  form  of  Lemma  7.17.  For  classes  of  sets,  we  can  get 
away  with  a  more  elementary  approach  that  leads  to  a  better  constant. 

The  problem  with  Lemma  7.17  is  that  it  insists  that  the  e-packing 
{Cl, . . .  ,Cm}  in  L2{g)  remains  a  e/2-packing  in  L2(gx).  This  strong  separa¬ 
tion  will  be  needed  when  we  extend  to  classes  of  functions  in  the  next  section. 
Here,  however,  we  are  only  interested  in  counting  |C  ft  {aq, . . . ,  ay}|  >  m. 
Therefore,  to  ensure  that  this  is  the  case,  we  only  need  to  ensure  that  the  sets 
Ci, . . . ,  Cm  remain  distinct  when  they  are  intersected  with  {aq, . . . ,  ay}. 

a.  Let  Xi, . . . ,  Xr  ~  g  be  i.i.d.  Show  that 

p [C n  {Vi, ...,xr}  =  cfn  {Xi, . . .xr}]  =  {1  -  ||ic  -  Ml hwY- 

b.  Conclude  that  if  Ci, . . . ,  Cm  is  an  £-packing  in  L2(g),  then 
P[Cin{Ai,  ...,xr}j£  Cjn{x1,  ...,xr}  for  all  i^j]>  l-m2{l-e2}r  >  0 
for  r  >  2s-2  logm  (compare  with  r  >  e~4logm  in  Lemma  7.17!) 

c.  Deduce  the  following  improved  form  of  Theorem  7.16: 

/  R A  (2+‘5)  vc(e) 

supJV(e,  II  •  II l=(m)>£)  <  for  a11  e  <  <*  > 

where  Kg  is  a  universal  constant  that  depends  on  S. 
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d.  The  last  bound  is  sharp  in  the  following  sense.  Let  C  =  {I  C  N  :  |/|  <  d}, 
so  that  vc(C)  =  d.  Show  that  for  a  universal  constant  K's  depending  on  <5 

/  T/f  \  (2 — 5)  vc(C) 

supiV(e,  II  •  II L2(M),£)  >  (  -y  J  for  all  e  <  1,  6  >  0. 

Hint:  consider  probability  measures  /i({?r})  oc  n~^1+a). 

Evidently  2vc(C)  is  the  optimal  value  of  the  exponent  in  the  behavior  of  the 
uniform  covering  numbers  of  a  class  of  sets.  In  the  above  bounds,  however, 
Kg  —>  oo  and  K’s  — >  0  as  S  —>  0.  A  delicate  analysis  due  to  Haussler  shows 
that  it  is  in  fact  possible  to  attain  the  exponent  2  vc(C)  with  a  finite  constant. 

7.7  (A  short  induction  proof  of  Pajor’s  theorem).  Our  proof  of  Theorem 
7.19  introduced  splitting  trees  as  a  bookkeeping  device.  The  insight  gained 
from  this  idea  will  pay  off  in  the  next  section.  In  the  case  of  sets,  however,  one 
can  rewrite  the  proof  in  a  much  more  efficient  manner  without  any  reference 
to  splitting  trees.  This  yields  perhaps  the  shortest  and  cleanest  approach. 

a.  Suppose  that  the  conclusion  of  Theorem  7. 19  holds  for  any  class  C  of  subsets 
of  X  with  |X|  =  m.  Show  that  the  conclusion  also  follows  when  |X|  =  m  +  1. 

Hint:  let  |X|  =  m+1  and  choose  any  x  €  X.  Define  6+  =  {C  €  C  :  x  €  C} 
and  C_  =  {C  €  C  :  x  ^  C},  and  apply  the  basic  argument  of  Lemma  7.22. 

b.  Conclude  the  proof  of  Theorem  7.19  by  induction  on  |X|. 

Let  us  emphasize  that  this  proof  is  essentially  identical  to  the  proof  we  have 
given.  Here  we  have  simply  merged  the  construction  of  the  splitting  tree  with 
the  proof  of  Lemma  7.22,  so  that  no  additional  bookkeeping  is  needed. 

7.8  (A  rearrangement  proof  of  Pajor’s  theorem).  The  goal  of  this  prob¬ 
lem  is  to  give  an  entirely  different  proof  of  Theorem  7.19  in  the  spirit  of  ex¬ 
tremal  combinatorics.  This  elegant  method  is  useful  in  many  other  problems. 

Let  us  begin  by  gaining  some  intuition.  A  class  C  of  subsets  of  a  set  X  is 
called  hereditary  if  C  €  C  implies  C'  €  6  for  all  C'  C  C. 

a.  Show  that  Theorem  7.19  holds  with  equality  for  hereditary  C. 

Evidently  hereditary  classes  are  extremal  with  respect  to  shattering.  The  idea 
we  will  now  pursue  is  that  an  arbitrary  class  C  can  be  transformed  into  a 
heredetary  class  without  changing  its  cardinality  or  increasing  the  number  of 
shattered  sets.  This  will  be  done  by  a  form  of  rearrangement  (in  analogy  with 
the  proof  of  the  classical  isoperimetric  inequality  by  Steiner  symmetrization). 

Consider  a  class  6  of  subsets  of  a  finite  set  X.  The  basic  step  that  we 
consider  is  as  follows.  Given  a  point  x  £  X,  define  =  {S^XC  ■  C  G  C}  such 
that  yxC  =  C\{a;}  if  C\{a:}  ^  C,  and  yxC  =  C  otherwise.  This  operation  is 
called  shifting:  it  tries  to  “remove  the  holes”  in  the  class  C  that  prevent  it  from 
being  hereditary,  one  coordinate  at  a  time.  Let  us  investigate  its  consequences. 
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b.  Show  that  |J^XC|  =  |C|. 

c.  Show  that  if  I  C  X  is  shattered  by  then  it  is  also  shattered  by  C. 

cl.  Show  that  if  =  C  for  all  x  £  X,  then  C  is  hereditary. 

e.  Now  starting  from  any  class  C,  repeatedly  apply  the  operation  S?x  by  cycling 
through  the  points  x  £  X.  Show  that  the  transformed  set  5fx 
becomes  hereditary  after  a  finite  number  q  of  such  operations. 

f.  Show  that  the  conclusion  of  Theorem  7.19  follows  readily  (while  we  assumed 
here  that  X  is  finite,  argue  that  this  entails  no  loss  of  generality). 

7.9  (Necessity  of  finite  VC-dimension).  We  have  seen  that  classes  C 
with  vc(C)  <  oo  have  many  nice  properties.  In  particular,  such  classes  admit 
distribution- free  bounds.  The  aim  of  this  problem  is  to  show  that  the  condition 
vc(C)  <  oo  is  often  also  necessary  to  obtain  distribution-free  results. 

Let  us  begin  by  considering  the  uniform  covering  number.  We  have  seen 

vc(C)  <  oo  implies  sup N(G,  ||  •  ||z,2(M), s)  <  oo 

by  Theorem  7.16.  Let  us  show,  conversely,  that  for  e  <  1/2 

vc(6)  =  oo  implies  supfV(C,  ||  •  \\l2(^)A)  =  °°- 

a.  Prove  the  following  basic  result. 

Lemma  7.24  (Gilbert- Varshamov).  Let  C  =  2X  be  the  class  of  all  sub¬ 
sets  ofX  =  {1, . . . ,  n}  and  let  d(C ,  D)  =  \CAD\ .  Then  N( C,  d,  n/4)  >  en/8. 

Hint:  use  a  “volume  argument”  with  the  uniform  measure  on  6  in  the  role 
of  the  volume,  and  use  Azuma-Hoeffding  to  estimate  the  volume  of  d-balls. 

b.  Conclude  that  vc(C)  =  oo  implies  supM  N(G,  ||  •  || =  oo  for  £  <  1/2. 
Hint:  let  /i  be  the  uniform  distribution  on  a  shattered  set  /  C  X. 

Let  us  now  consider  the  uniform  law  of  large  numbers.  We  have  seen  that 


vc(C)  <  oo  implies  limsupsupE 

n—>  oo  fi 


sup  I Hn(C)  -  n(C)\ 
cee 


=  0 


by  Corollary  7.18.  Let  us  show,  conversely,  that 


vc(C)  =  oo  implies  liminfsupE 

rwoo  p 


Slip  \Hn{C)  -  fl(C)  | 
Lose 


>  0. 


Thus  vc(C)  <  oo  is  sufficient  and  necessary  to  obtain  a  distribution-free  rate 
in  the  uniform  law  of  large  numbers  (the  uniform  Glivenko-Cantelli  property). 
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c.  Let  £1, .. .  ,en  be  i.i.d.  symmetric  Bernoulli.  Show  that  vc(C)  =  oo  implies 


supE 

sup 

1  n 

— y^£kic(Xk) 

cee 

n  z ' 

k= l  J 

Hint:  let  /x  be  the  uniform  distribution  on  a  shattered  set  of  cardinality 
N  n,  and  show  that  {Xi, . . . ,  Xn }  is  shattered  with  high  probability. 

d.  Conclude  that  the  uniform  Glivenko-Cantelli  property  fails  if  vc(C)  =  oo. 
Hint:  see  Problem  7.2. 

Finally,  we  argue  that  the  distribution-free  rate  obtained  in  Corollary  7.18  is 
even  quantitatively  correct  up  to  universal  constants.  That  is,  let  us  show  that 


K \J vc(C)  <  lim  inf  sup  E 

n—*oo  i, 


sup  Vn\tin(C)  -  n{C)  | 
cee 


<  lim  sup  sup  E 

n—*  OO  /i, 


sup  y/n\fj,n(C)  -  n(C)  I 
ce  e 


<  Lv'vc(e). 


In  view  of  Corollary  7.18,  we  must  only  prove  the  lower  bound. 

e.  Denote  by  {ZM(C')}cee  be  the  centered  Gaussian  process  whose  covariance 
function  is  given  by  Co v[ZM(C),  Z^{C')\  =  CovM[lc,  Iq']-  Show  that 


lim  inf  sup  E 

n  *°°  n 


sup  s/n\nn{C)  -  n(C) I 
cee 


>  sup  E 


sup  \Z)1{C)\ 
cee 


f.  Show  that  the  right-hand  side  in  the  last  inequality  is  >  \] vc(Gj. 

Hint:  choose  /x  to  be  uniformly  distributed  on  a  shattered  set  /,  and  repre¬ 
sent  Z^C)  =  |^r1/2  J2xei  9x{lc{x)  -M(C')}  with  {gx}xei  i-i-d.  IV(0,1). 

7.10  (Glivenko-Cantelli  theorem  and  convex  sets).  We  have  seen  in 
the  previous  problem  that  vc(C)  <  oo  is  necessary  and  sufficient  in  order  for 
the  law  of  large  numbers  to  hold  uniformly  over  C  with  a  distribution-free 
rate.  However,  when  vc(C)  =  oo,  it  can  still  be  the  case  that  the  law  of  large 
numbers  holds  uniformly  over  C  for  any  given  distribution  /i .  We  characterized 
such  classes  in  Problem  7.2  in  terms  of  a  random  entropy  condition.  It  turns 
out  that  in  the  case  of  sets,  the  entropy  condition  can  be  replaced  by  a  random 
combinatorial  condition:  6  is  a  /u-Glivenko-Cantelli  class  if  and  only  if 


vc(6  PI  {^G,  .  •  •  ,  Xn })  n—>cx) 
n 


in  probability, 


where  X-| ,  X2 , . . .  is  an  i.i.d.  sequence  of  variables  with  distribution  y,.  Note 
that  this  condition  can  clearly  hold  even  when  vc(C)  =  oo. 

a.  Show  that  the  above  condition  implies  the  ^-Glivenko-Cantelli  property. 
Hint:  use  the  random  entropy  condition  of  Problem  7.2. 
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b.  Show  that  the  ^t-Glivenko-Cantelli  property  implies  the  above  condition. 
Hint:  start  with  the  symmetrized  formulation  from  Problem  7.2,  and  use 
that  E[sup  t<ETJ2k£I£ktk\  >  E[sup  terJZkeJ5^]  when  J  Cl. 

The  advantage  of  the  combinatorial  formulation  is  that  shattered  sets  are 
very  rigid  structures  that  are  often  easy  to  detect.  Nonetheless,  in  the  present 
setting  we  must  understand  what  random  combinatorial  structures  can  arise 
in  a  sample  X\, . . . ,  Xn  from  a  given  distribution  p,  which  may  not  be  a  trivial 
matter.  Let  us  develop  in  detail  one  example  in  which  this  can  be  done. 

Let  6  be  the  class  of  all  compact  and  convex  subsets  of  X  =  [0,  l}d  (we  can 
easily  extend  the  following  arguments  to  the  case  X  =  Rd  by  a  straightforward 
truncation,  but  this  provides  no  additional  insight).  It  was  shown  in  Problem 
7.5  above  that  vc(C)  =  oo.  Nonetheless,  we  will  show  that  6  is  /n-Glivenko- 
Cantelli  whenever  p  has  a  density  with  respect  to  Lebesgue  measure. 

c.  Find  an  example  of  a  measure  p  such  that  6  fails  to  be  /z-Glivenko-Cantelli. 
Thus  the  assumption  that  p  has  a  density  is  not  superfluous. 

d.  Show  that  a  set  I  is  shattered  by  6  if  and  only  if  none  of  the  points  x  G  I  is 
a  convex  combination  of  the  others  7\{a;}  (that  is,  /  is  in  convex  position). 

e.  Show  that  if  p  has  a  density  with  respect  to  Lebesgue  measure,  then  the 
boundary  dC  of  every  convex  set  C  G  6  has  zero  measure  p(dC)  =  0. 

Hint:  if  0  G  int  C,  then  dC  C  (1  +  e)C\(l  —  e)C. 

The  heuristic  idea  behind  the  proof  is  now  as  follows.  By  the  combinatorial 
formulation  developed  in  the  first  part  of  this  problem,  we  must  show  that 
among  n  random  points  X±, . . . ,  Xn ,  the  maximal  size  of  a  subset  that  is  in 
convex  position  is  sublinear  in  n.  Suppose,  to  the  contrary,  that  there  is  a 
subset  I  C  {Xi, . . . ,  Xn }  with  |/|  >  an  that  is  in  convex  position.  Then  the 
boundary  of  the  convex  set  C  =  conv  /  has  empirical  measure  pn{dC)  >  a.  If 
we  could  argue  jj,n(dC)  «  /x(9C)  for  all  C  €  C,  we  would  have  a  contradiction. 
At  first  sight,  it  seems  like  this  got  us  nowhere:  we  must  now  prove  that  the 
class  SC  of  boundaries  of  convex  sets  is  //- G 1  i venko- C ant e  1 1  i !  But  the  latter 
problem  can  be  addressed  by  exploiting  the  geometry  of  convex  sets. 

f.  Let  Xm  be  the  partition  of  X  =  [0,  l]d  into  md  cubes  of  side  length  1/m. 
Define  the  discretized  boundary  drnC  =  U{H  G  Xm  :  B  D  dC  ^  0}.  Prove 

limsup  sup  nn(dC)  <  inf  sup  fj,(dmC). 
n — xx>  cee  cee 

g.  Clearly  infm>i  p{dmC)  =  p{dC)  =  0,  but  we  need  this  conclusion  to  hold 
uniformly  over  C  G  6.  Show  that  if  p  is  the  Lebesgue  measure  on  X,  then 

suP/x(a3-C')  <  (l-3"d)m  for  all  m  >  1. 
cee 

Hint:  for  m  =  1,  the  partition  X3  consists  of  one  cube  in  the  center  of  X 
surrounded  by  3d  —  1  cubes  along  the  sides  of  X.  Show  that  if  all  the  cubes 
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along  the  sides  contain  a  point  in  dC,  then  the  middle  cube  cannot  intersect 
dC.  Thus  p(d3C)  <  (1  —  3_d)^(<9iC).  Now  iterate  this  argument. 

h.  Deduce  that  if  /i  has  a  density  with  respect  to  Lebesgue  measure,  then 

inf  sup  [x{dmC )  =  0. 

m>i cee 

i.  Conclude  that  the  combinatorial  condition  formulated  at  the  beginning  of 
this  problem  holds  for  C  whenever  p  has  a  density  with  respect  to  Lebesgue 
measure  by  carefully  making  precise  the  reasoning  given  above. 

7.11  (Kolmogorov,  Smirnov,  and  Donsker).  Let  X±,X2, ...  be  i.i.d.  real¬ 
valued  variables  with  distribution  function  F(x)  =  /r(]—  oo, a;]),  and  define  the 
empirical  distribution  function  Fn(x )  =  /x„(]— oo,  x]).  The  classical  Glivenko- 
Cantelli  theorem  states  that  \\Fn  —  F||oo  — ►  0.  By  Corollary  7.18,  the  conver¬ 
gence  even  takes  place  at  the  central  limit  theorem  rate  | Fn  —  FH^  <  n-1/2. 
We  might  therefore  wonder  whether  one  can  go  one  step  further  and  show 
that  y/n\\Fn  —  F||oo  converges  weakly  to  some  limiting  distribution. 

a.  Let  Gn{x)  :=  y/n{Fn(x)  —  F(x)}.  Show  that  for  any  xi, . . . ,  Xk  €  R 

(■ Gn(xi),...,Gn(xk ))  =>  (B(F(xi)),...,B(F(xk)))  in  distribution. 

Here  {F(f)}te[0,i]  is  the  Brownian  bridge  defined  by  B(t)  =  W(t)  —  tW(  1), 
where  {kL(t)}tero;i]  is  standard  Brownian  motion. 

In  view  of  this  computation,  it  is  natural  to  conjecture  that  y/n\\Fn  —  Fjloo 
converges  in  distribution  to  HBHoo,  the  supremum  of  a  Brownian  bridge  (note 
that  this  limiting  distribution  does  not  depend  on  the  law  /i!)  This  is  indeed 
the  case,  as  was  proved  by  Kolmogorov  and  Smirnov  in  the  1930s,  and  is  of 
significant  importance  in  classical  nonparametric  statistics. 

It  is  obvious  from  the  central  limit  theorem  that  if  I  C  R  is  a  finite  set, 
then  maxl6j  yj n\Fn{x)  —  F (x)\  converges  in  distribution  to  maxie;  \B(F(x))\. 
It  is  not  at  all  clear,  however,  that  this  is  still  the  case  for  I  =  R.  To  prove 
this,  we  must  establish  that  y/n\\Fn  —  Fll^  can  be  approximated  uniformly  in 
n  by  maxx6j  y/n\Fn(x)~ F(x)\  for  sufficiently  large  finite  sets  I.  It  is  here  that 
the  empirical  process  machinery  that  we  have  developed  enters  the  picture. 

b.  Let  QCK2.  Show  that 


E 

sup  \Gn(x)  —  Gn{x')\ 

<  E 

ui(  sup  l-Fjj(x)  —  Fn(x')  |  ) 

.  (x,x')EQ 

.  V  (i,x')gQ  /  . 

where  u(u)  :=  f0  u  y  log  \de  <  \Ju  log(l/u). 
c.  Let  Qs  =  {(x, x')  :  |F(x)  —  F{x')\  <  d}.  Prove  asymptotic  equicontinuity 


sup  \Gn(x)  —  Gn(x')\ 
.  (x,x')eQs 


=  0. 


lim  lim  sup  E 

<510  n — >oo 
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d.  Show  that  there  exist  finite  sets  Ik  C  K  such  that 


lim  lim  sup  E 

k  >00  n — 


1 1  Fn  E 1 1  oo 


max  | Fn(x)  —  F(a;)| 

*6  Ik 


=  0, 


and  conclude  that 


y/n\\Fn  -  F\\oo  =>  ||i?||oo  hr  distribution. 

From  the  asymptotic  equicontinuity  result  obtained  above,  we  can  in  fact 
derive  a  much  more  general  statement  of  the  idea  that  the  empirical  process 
Gfc  converges  weakly  to  the  Brownian  bridge  Bo F.  This  result,  originally  due 
to  Donsker,  can  be  viewed  as  a  uniform  central  limit  theorem. 

e.  View  the  empirical  process  x  i— >  Gn(x)  as  a  random  path  with  values  in 
E°°(]R).  Show  that  for  any  functional  H  :  £°°(R)  — >  R.  that  is  Lipschitz  in 
the  sense  |H[G]  -  H[G']|  <  L\\G  -  G'H*,  for  all  G,  G'  G  £,°°(R),  we  have 

E[H[G„]]  ->  E[H[BoF]]  as  n  ->  oo. 

(Assume  for  simplicity  that  H[G„]  and  H[B  o  F]  are  measurable,  though 
this  is  neither  obvious  nor  always  true;  measurability  issues  of  this  kind 
arise  often  in  the  development  of  uniform  central  limit  theorems.) 

While  we  have  considered  the  example  of  empirical  distribution  functions  for 
sake  of  illustration,  uniform  central  limit  theorems  can  be  developed  in  con¬ 
siderable  generality.  A  class  of  functions  fF  for  which  the  empirical  process 
satisfies  the  central  limit  theorem  in  £°°(5F)  is  called  a  Donsker  class.  The 
characterization  of  such  classes,  as  well  as  closely  related  questions  concern¬ 
ing  central  limit  theorems  in  Banach  spaces,  have  historically  motivated  the 
development  of  many  of  the  tools  that  are  used  to  control  empirical  processes. 


7.3  Combinatorial  dimension  and  uniform  covering 

In  the  previous  section  we  developed,  in  the  special  case  of  classes  of  sets, 
a  combinatorial  method  to  control  uniformly  the  random  covering  numbers 
that  appear  in  symmetrization  bounds.  In  a  sense,  is  not  surprising  that  com¬ 
binatorics  enters  the  picture  in  this  setting:  as  the  empirical  measure  pn  that 
arises  in  the  symmetrization  process  is  supported  on  a  finite  set,  it  is  natural 
that  our  bounds  for  classes  of  sets  will  essentially  reduce  to  the  combinatorial 
problem  of  counting  induced  subsets.  Whether  such  ideas  are  still  useful  in  the 
general  setting  of  classes  of  functions  is  far  from  clear  at  this  point:  even  when 
restricted  to  a  finite  set,  a  class  of  functions  is  still  a  continuous  object  (with  a 
potentially  nontrivial  geometric  structure)  and  is  not,  a  priori,  combinatorial 
in  nature.  Nonetheless,  the  theory  of  previous  section  admits  a  very  natural 
generalization  to  classes  of  functions,  which  we  develop  presently. 
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To  gain  some  intuition,  let  us  begin  by  reconsidering  a  class  of  sets  C  in 
terms  of  the  corresponding  class  of  indicator  functions  =  {lc  :  C  £  6}.  As 
indicator  functions  only  take  the  values  zero  and  one,  the  restricted  class 


3W, ,  /(*„))  :  /  S  T}  C  R" 


is  a  subset  of  the  hypercube  {0,  l}n.  In  particular,  a  set  {xi, . . .  ,xn}  is  said 
to  be  shattered  by  C  precisely  when  =  {0, 1}"  is  the  full  hypercube. 

Thus  we  can  interpret  vc(C)  geometrically  as  the  largest  dimension  of  a  hyper¬ 
cube  that  is  contained  in  a  coordinate  projection  of  C.  This  idea  is  illustrated 
in  the  following  figure  for  different  classes  of  subsets  of  {x\,X2}'- 


Icfe) 


1  (H 


0  - - 

0  1 


■>  lc(*i) 


lc(l2) 

ll- 


0  - • 

0  1 


lc(*i) 


lo(®a) 

it- - *- 


0» - *->  lc(ii) 

0  1 


vc(C)  =  0  vc(C)  =  1 


vc(C)  =  2 


In  contrast  to  the  special  case  of  indicator  functions,  for  a  general  class  of 
functions  2f  the  restricted  class  3r|xli...;Xn  can  be  an  arbitrary  subset  of  Rn. 
In  analogy  with  the  combinatorial  theory  of  the  previous  section,  we  might 
try  to  define  the  VC-dimension  of  T  as  the  largest  dimension  of  a  cube  that 
is  contained  in  a  coordinate  projection  of  3\  However,  unlike  in  the  case  of 
indicator  functions,  there  is  some  ambiguity  in  this  definition  in  the  general 
setting:  the  notion  of  dimension  we  obtain  will  depend  on  the  size  of  the  cubes 
that  we  consider  and  not  just  on  their  dimension.  For  example,  it  is  perfectly 
possible  that  £F  contains  only  low-dimensional  cubes  of  the  form  {0, 1}",  but 
contains  high-dimensional  cubes  of  the  form  {0,e}"  for  e  1.  To  emphasize 
this  point,  consider  a  simple  example  that  is  illustrated  in  the  following  figure: 


/O2)  f(x 2) 


-I 


T  • 

*-  -i 


The  projection  3r|a:i,a:2  contains  a  cube  of  size  at  most  ~  e,  but  each  of  the 
projections  T|Xl  and  T|a,2  contain  a  cube  of  size  ~  1.  The  set  evidently  contains 
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no  cubes  of  size  1.  Thus  the  dimension  of  the  set  depends  on  the  scale  at 
which  we  are  viewing  it:  it  is  zero-dimensional  at  very  large  scales  (it  looks 
like  a  point),  it  is  one-dimensional  at  scale  ~  1  (it  looks  like  the  letter  L), 
and  it  is  two-dimensional  at  scale  ~  e  (where  we  see  the  “fatness”  of  the  set). 
If  the  class  J  is  defined  on  other  points  X3,  X4. . . .  as  well,  then  the  set  can 
be  higher-dimensional  still  when  viewed  as  smaller  scales.  The  dependence 
of  the  dimension  on  scale  is  not  a  drawback  of  this  approach,  but  a  genuine 
phenomenon:  in  extending  the  theory  of  the  previous  section  to  the  general 
setting,  we  must  introduce  a  scale-sensitive  notion  of  dimension  in  order  to 
capture  the  structure  of  the  set  from  the  point  of  view  of  covering  numbers. 
In  the  remainder  of  this  section  we  will  make  these  ideas  precise. 

Let  us  begin  by  making  precise  what  we  mean  by  the  statement  that  a 
coordinate  projection  of  T  contains  a  cube.  The  requirement  that  3rXl,...,xn 
actually  contains  a  copy  of  some  hypercube  {0,  e}"  is  too  stringent:  for  exam¬ 
ple,  if  JXl,...,Xn  were  itself  a  tiny  perturbation  of  a  hypercube  (e.g.,  perturb 
each  corner  of  the  hypercube  randomly) ,  then  it  would  not  contain  any  hyper¬ 
cube  but  the  dimension  should  not  be  much  affected.  Instead,  we  introduce  a 
slightly  more  flexible  generalization  of  the  notion  of  a  shattered  set. 

Definition  7.25  (e-shattering).  Let  I  C  X  and  h  £  R1.  The  pair  ( I,h )  is 
said  to  be  E-shattered  by  if  for  every  J  C  I,  there  exists  f  £  T  such  that 

/( x)  <  h{ x)  for  x  £  J,  f(x)  >  h{x)  +  e  for  x  £  I\J. 

The  set  I  C  X  is  said  to  be  e-shattered  if  (/,  h)  is  e-shattered  for  some  h  £  R1. 

If  the  inequalities  f(x)  <  h(x)  and  f(x)  >  h(x)  +  e  in  the  definition  of  an 
e-shattered  set  were  replaced  by  equalities,  then  the  definition  would  reduce  to 
the  statement  that  £F|j  D  h+  {0,e}iJl,  that  is,  that  the  coordinate  projection 
of  on  /  contains  a  (translate  of  the)  hypercube  {0,  e}^ .  When  the  class  T  is 
convex  these  two  definitions  are  even  equivalent,  see  Problem  7.13.  However, 
in  the  general  setting,  the  notion  of  £-shattering  as  defined  above  provides  a 
suitable  implementation  of  the  idea  that  contains  a  combinatorial  structure 
that  is  “larger”  than  a  hypercube  {0,£}l7!  in  the  appropriate  sense. 

Having  defined  a  notion  of  shattering  for  function  classes,  we  can  analo¬ 
gously  extend  the  definition  of  VC-dimension  for  a  given  scale  £  >  0. 

Definition  7.26  (Combinatorial  dimension).  The  combinatorial  dimen¬ 
sion  o/T  at  scale  £  is  defined  as  vc(T,  e)  :=  sup{|/|  :  I  is  e-shattered  by  T}. 

Remark  1 .21 .  vc(T,  e)  is  known  under  various  different  names,  including  scale- 
sensitive  dimension  or  the  somewhat  lipectomous  fat- shattering  dimension. 
Note  that,  by  its  definition,  vc(ff,  e)  is  increasing  as  £  J,  0. 


To  illustrate  this  notion,  let  us  consider  two  useful  examples. 
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Example  1 .28  (Vector  spaces).  Let  X  be  any  set  and  let  /i,  X  —>  R  be 

linearly  independent  functions.  Consider  the  linear  class  of  functions 

=  {aifi  +  •  •  •  +  adfd  '■  o, i, . . . ,  ffld  €  R}. 

We  claim  that  the  combinatorial  dimension  of  lb  is  given  by 

vc(9y  e)  =  d  for  all  e  >  0. 

Thus  in  this  case,  the  dimension  of  fF  does  not  depend  on  the  scale  e. 

Let  us  first  show  that  vc(3y  e)  >  d.  By  linear  independence,  we  can  choose 
X\, . . . ,  Xd  £  X  so  that  the  matrix  M  with  Mij  =  fj(xi)  is  nonsingular.  Then 
for  any  b  £  Rd,  we  can  find  /  £?  such  that  /(x,)  =  bi  for  all  i:  just  choose 

d 

f  =  ^2  aifi  a  =  M~1b. 

»= 1 

It  follows  immediately  that  {xi, . . . ,  Xd}  is  £-shattered. 

It  remains  to  show  that  vc(T,  e)  <  d.  Suppose  there  exists  an  e-shattered 
set  I  =  {xi, . . . ,  x,j+i}.  The  matrix  M  defined  above  is  now  a  (d+1)  x  d 
matrix,  so  there  exists  a  vector  2  £  Rd+1\{0}  such  that  z*M  =  0.  Thus 

d+l 

T.  Zif(xi)  =  0  for  all  /g  J. 

i=  1 

As  I  is  ^-shattered,  however,  we  can  choose  /+,  /-  G  7  so  that  f±(xi)  <  hi  for 
sign^  =  =f1  and  f±{xi)  >  hi  +  e  otherwise.  Then  /  =  /+  —  /_  G  7  satisfies 

d+l  d+l 

£>/(*<)  -  sy\zi\  >  °> 

i=l  i=l 

which  entails  a  contradiction.  Thus  {xi, . . .  ,Xd+ 1}  cannot  be  ^-shattered. 

Example  7.29  (Functions  of  bounded  variation).  Recall  that  the  total  variation 
of  a  function  /  :  R  — ►  R  is  defined  in  the  following  manner: 

n—1 

ll/llvar  :=  SUP  Slip  ^  |/(xfe+i)  -  /(xfc)  |. 
n  xi<-”<xn  , 

k—1 

Let  us  consider  the  class  of  functions  of  bounded  variation 

T={/:R^R:  ||/||var  <  V}. 

There  are  many  functions  of  bounded  variation:  examples  include  bounded 
increasing  functions  and  Lipschitz  functions  with  compact  support. 

We  are  going  to  show  that  the  combinatorial  dimension  of  T  satisfies 
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vc (S',  e)  =  1  + 


for  all  £  >  0. 


Thus,  unlike  in  the  previous  example,  the  class  T  is  genuinely  infinite¬ 
dimensional:  the  combinatorial  dimension  diverges  as  e  j  0.  Nonetheless,  at 
every  fixed  scale  the  class  is  finite-dimensional,  which  is  precisely  what  will 
be  needed  to  estimate  the  uniform  covering  numbers  below. 

Consider  I  =  {xi, . . . ,  xn}  C  R  with  x±  <  ■■■  <  xn.  Suppose  that  /  is 
e-shattered  by  3\  Then  we  can  find  h  £  R7  and  /+,/_£  T  such  that 


f+(xi)  <  h{xi)  for  odd  i ,  >  h(xi)  +  e  for  even  i , 

f-{xi)  <  h(xi)  for  even  i ,  f-{xi )  >  h(xi)  +  e  for  odd  i. 


In  particular,  /  =  1{/+  -/-}£?  satisfies 

£  £ 
f(xi)  <  for  odd  i,  f(xi)  >  -  for  even  i. 

This  construction  is  illustrated  in  the  following  figure. 


X1  X2  X3  Xi  x5 


By  construction,  we  can  now  estimate 

(n  1  )e  <  E  l/(^+l)  -  f(xk)  I  <  ||/||var  <  ll/+ll-+ll/-l|var  <  ^ 

fe=  1 

and  thus  the  cardinality  of  our  shattered  set  must  satisfy  n  <  1  +  V/e.  As  the 
combinatorial  dimension  is  integer,  this  evidently  implies  vc(T,  e)  <  1+  • 

Now  let  X\  <  . . .  <  xn  with  n  —  1  +  \  V/£\  be  arbitrary.  Define 

{el x^J  for  x  G  ]-oo,x2[, 
el xi£j  for  x  e  [xi,xi+ 1[,  1  <  i  <  n, 
el Xngj  for  x  G  [xn,  oo [ 

for  every  J  C  {xi, . . . ,  xn}.  Then  ||/j||Var  <  (n—l)e  <  V,  so  fj  £  3r.  Moreover, 
by  construction,  fj(xi)  =  0  if  Xi  £  J  and  fj(xi)  =  e  if  Xi  £  J .  Thus  any  set 
of  cardinality  n  is  e-shattered  by  T,  so  we  have  proved  vc(T,  e)  =  1  +  \  V /e\ . 

In  view  of  the  above  discussion  and  examples,  the  combinatorial  dimension 
vc(T,  e)  is  evidently  a  natural  analogue  in  the  general  setting  of  the  VC- 
dimension  of  a  class  of  sets.  However,  the  real  power  of  this  notion  lies  not  in 
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its  definition,  but  in  the  fact  that  it  can  be  used  to  bound  uniform  covering 
numbers  in  direct  analogy  to  the  theory  developed  in  the  previous  section. 
This  is  made  precise  by  the  following  generalization  of  Theorem  7.16. 

Theorem  7.30  (Mendelson-Vershynin).  Let  T  be  a  class  of  functions  on 
X  that  is  uniformly  hounded  sup^eg-  ||/||oo  <  1-  Then  we  have 

/K\Kvc(?,e/K) 

sup  N(T,  ||  •  \\l2(h),£)  <  J  for  all  £  >  0, 

where  K  is  a  universal  constant. 

Note  that  Theorem  7.30  is  indeed  a  generalization  of  Theorem  7.16:  if 
T  =  {lc  :  C  €  C},  then  it  is  easily  seen  that  vc(T,  e)  =  vc(C)  for  all  £  <  1, 
and  thus  we  recover  Theorem  7.16.  On  the  other  hand,  unlike  in  the  case  of 
sets,  Theorem  7.30  can  bound  the  covering  numbers  of  classes  of  functions 
with  infinite  metric  dimension:  for  example,  if  we  consider  the  class 

T={/:R-r[-l,l]:||/||var<n 

then  Theorem  7.30  yields 

suP7V(T,||-|U2(M),£)<e^10^, 

v 

so  this  bound  on  the  covering  numbers  even  grows  superexponentially  in  1/e 
(we  will  see  in  the  next  section  that  the  optimal  bound  in  this  example  is  only 
exponential  in  1/e;  however,  the  above  bound  suffices  in  most  applications). 

We  now  turn  to  the  proof  of  Theorem  7.30.  The  main  steps  in  the  proof  are 
precisely  the  same  as  in  Theorem  7.16.  We  will  first  use  probabilistic  extraction 
to  reduce  the  original  continuous  problem  to  a  combinatorial  problem;  we 
already  phrased  the  extraction  Lemma  7.17  in  terms  of  functions,  so  that 
no  additional  work  is  needed.  Then,  we  will  use  a  combinatorial  principle  to 
resolve  the  finite  problem.  The  main  challenge  in  the  general  setting  is  to  prove 
a  counterpart  of  Pajor’s  Theorem  7.19  that  counts  e-shattered  sets  (/,  h).  Let 
us  begin  by  giving  a  precise  statement  of  the  requisite  result. 

Definition  7.31  (e-cube).  A  pair  (. I,h )  is  called  a  e-cube  in  3  if  I  C  X, 
h  £  (eZ)7,  and  the  pair  (/,  h)  is  e-shattered  by  “J. 

Thus  an  e-cube  is  simply  an  e-shattered  pair  (I,  h)  such  that  the  values  of 
h(x)  are  integer  multiples  of  e.  The  reason  for  the  latter  restriction  is  to  ensure 
that  the  problem  of  counting  e-cubes  is  a  combinatorial  one:  if  |X|  <  oo  and 
|| /|| oo  <  1  for  all  /£?,  then  there  are  only  a  finite  number  of  possibilities  for 
I  and  h.  The  following  result  is  a  form  of  Pajor’s  Theorem  7.19  for  e-cubes. 

Theorem  7.32.  Let  T  be  a  class  of  functions  and  let  p  be  a  probability  on  X. 
Then  for  any  S  C  J  that  is  a  ce -packing  of  (T,  ||  •  ||l2(^)),  we  have 
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|3|1/2  <  \{(I,h)  :  ( I,h )  is  an  e-cube} |. 

Here  c  is  a  universal  constant. 

Note  that  even  in  the  special  case  of  indicator  functions,  Theorem  7.32 
yields  a  somewhat  weaker  result  than  Theorem  7.19.  While  these  two  results 
and  their  proofs  are  very  much  in  the  same  spirit,  there  is  a  genuninely  new 
difficulty  that  arises  in  the  setting  of  functions  that  must  be  overcome  by 
Theorem  7.32  and  that  accounts  for  the  difference  between  the  two  results. 
To  understand  the  problem,  note  that  for  indicator  functions  lc?(x)  1  d(%) 
necessarily  implies  1  cix)  <  0  and  1  d(x)  >  1  or  vice  versa,  so  a  shattered 
set  is  automatically  1-shattered.  On  the  other  hand,  for  arbitrary  functions 
f(x)  ^  g( x)  does  not  imply  f(x)  <  h  and  g(x)  >  h  +  e  or  vice  versa,  as  is 
needed  in  the  definition  of  e-shattering.  In  the  process  of  counting  ^-shattered 
sets  we  will  necessarily  have  to  throw  out  some  of  the  functions  in  S  that 
happen  to  take  values  in  the  forbidden  regions  [h,  h  +  e ],  and  the  key  difficulty 
in  the  proof  is  to  ensure  that  we  do  not  discard  too  many  of  these  functions. 
The  assumption  that  S  is  a  ce-packing  of  (T,  |  •  ||l2(^))  is  needed  to  ensure 
that  we  can  find  coordinates  on  which  there  are  many  functions  in  S  that  do 
not  take  values  in  [h,  h  +  e\.  On  the  other  hand,  after  throwing  out  the  “bad” 
functions  we  will  only  be  able  to  ensure  that  we  have  |S|1//2  functions  left  over, 
which  accounts  for  the  difference  between  the  conclusions  of  Theorems  7.32 
and  7.19.  These  ideas  will  be  made  precise  in  the  proof. 

Before  proving  Theorem  7.32,  however,  let  us  first  complete  the  proof  of 
Theorem  7.30  as  we  now  have  all  the  necessary  ingredients  to  do  so.  We  begin 
by  formulating  an  analogue  of  the  Sauer-Shelah  lemma  in  the  present  setting. 

Corollary  7.33.  Let  T  be  a  class  of  functions  on  a  finite  set  X  with  H/Hoo  <  1 
for  all  /£?.  Then  for  any  probability  g  and  ce-packing  S  of  (T,  ||  •  \\l2(ij.)) 


Proof.  If  (/,  h )  is  an  e-cube,  then  h(x)  is  an  integer  multiple  of  e  and  we  must 
have  —1  <  h{ x)  <  1  as  ||/||oo  <  1  for  all  /£?.  Thus,  for  a  given  ICX,  there 
can  be  at  most  (2)^  e-cubes  ( I,h ).  There  are  consequently  at  most  (^)(f)fc 
e-cubes  (/,  h)  with  |/|  =  k.  By  definition,  however,  any  e-cube  (/,  h )  must 
have  |/|  <  vc(T,  e).  Thus  the  first  inequality  follows  from  Theorem  7.32,  while 
the  second  inequality  follows  as  in  the  proof  of  Lemma  7.12.  □ 

We  can  now  complete  the  proof  of  Theorem  7.30. 

Proof  (Theorem  7.30).  Let  g  be  any  probability  on  X,  and  let  S  =  {/i, . . . ,  fm} 
be  a  maximal  e-packing  of  (T,  ||  •  ||l2(ai)).  By  Lemma  7.17,  there  exist  r  < 
ce-4  fog  to  points  x±, . . .  ,xr  such  that  S  is  an  e/2-packing  of  gx  =  -  &xk- 

Using  Corollary  7.33  and  arguing  as  in  the  proof  of  Theorem  7.16  yields 
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m1/2  < 


(  log  m  4ec\ vc(-f'-/2c)  f  A{Aec)1^5 

\vc (7,e/2c)l*  )  ~m  V  e 


5  vc(3r,e/2c) 


As  N(3r,  ||  •  \\l2(/j,)>£)  <  m,  the  proof  is  readily  completed. 


□ 


The  remainder  of  this  section  is  devoted  to  the  proof  of  Theorem  7.32. 
Let  us  first  recall  how  we  proved  the  analogous  result  for  classes  of  sets:  first, 
we  introduced  a  structure,  called  a  splitting  tree,  to  help  us  count  shattered 
sets.  A  shattered  set  corresponds  to  a  complete  splitting  tree,  but  these  are 
hard  to  find.  Instead,  we  proved  a  sort  of  Ramsey  principle:  any  splitting  tree 
contains  at  least  as  many  complete  subtrees  as  the  number  of  leaves  in  the 
tree.  For  a  class  of  sets  C,  it  was  trivial  to  construct  a  splitting  tree  with  |C| 
leaves  in  a  greedy  fashion,  and  thus  the  result  followed. 

We  will  follow  exactly  the  same  approach  in  the  proof  of  Theorem  7.32. 
Let  us  begin  by  defining  the  analogue  of  a  splitting  tree  in  the  present  setting. 


Definition  7.34  (e-splitting  tree).  Let  7  be  a  class  of  functions  on  X.  A 
7 -tree  A  is  called  an  e-splitting  tree  if  every  A  G  A  that  is  not  a  leaf  satisfies: 

1.  A  has  exactly  two  children  A+  and  A-; 


2.  There  exist  xA  £  X  and  hA  €  eZ  such  that 


f(xA)  <  hA  for  f  £  A- ,  f(xA)  >hA+e  for  f  £  A+. 

In  exact  analogy  to  the  previous  section  (cf.  Definition  7.21  and  the  dis¬ 
cussion  thereafter),  an  e-cube  corresponds  to  a  complete  e-splitting  tree,  while 
any  e-splitting  tree  contains  at  least  as  many  complete  subtrees  as  leaves. 

Lemma  7.35.  Let  7  be  a  class  of  functions  on  X.  For  any  e-splitting  tree  A 

| {leaves  o/A}|  <  |{(J,  h)  :  (I,  h)  is  an  e-cube]  |. 


Proof.  The  proof  is  identical  to  that  of  Lemma  7.22.  □ 

It  only  remains  to  construct  an  e-splitting  tree.  While  this  was  trivial  in 
the  case  of  sets,  it  is  here  that  the  difficulties  arise  in  the  general  setting. 

Let  us  recall  in  more  detail  how  we  constructed  a  splitting  tree  for  a 
class  7  of  indicator  functions  of  sets.  Let  A  =  {lc  '■  C  £  6}  be  a  class  of 
indicators.  Note  that  for  indicator  functions,  1  c  yf  1  d  necessarily  implies 
that  Lc{x)  =  0  and  1  d(%)  =  1,  or  vice  versa,  for  some  x  £  X.  Therefore, 
as  long  as  A  is  not  a  singleton,  we  can  partition  A  into  two  nonempty  sets 
A+  =  {lc  £  A  :  lc(a;)  =  1}  and  -A-  =  {lc  €  A  :  1  c(x)  =  0}.  We  can 
now  repeatedly  apply  this  construction,  starting  at  the  root  7,  until  all  of  the 
leaves  of  the  resulting  tree  are  singletons.  The  key  point  of  this  construction 
is  that  nothing  was  lost  in  the  process,  so  the  leaves  of  the  tree  must  form  a 
partition  of  7.  But  each  leaf  is  a  singleton,  so  there  are  T|  leaves. 
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Let  us  now  attempt  to  apply  the  same  idea  to  a  general  class  of  functions 
£F.  Consider  a  set  of  functions  A  C  “J  that  is  not  a  singleton.  Unlike  in  the 
case  of  indicators,  /  yf  g  does  not  imply  that  /( x)  <  h  and  g{x)  >  h  +  e, 
or  vice  versa,  for  some  x  £  X  and  h  £  R,  as  is  needed  for  the  construction 
of  the  children  of  A.  Thus  we  must  assume  some  form  of  separation  between 
the  elements  of  A.  The  minimal  assumption  we  could  impose  is  that  A  is  an 
e-packing  of  (T,  ||  •  Hoc):  this  would  ensure  that  ||/  —  gljoo  >  e,  and  thus  the 
above  conclusion  would  follow.  Therefore,  if  we  introduce  this  assumption, 
then  both  A+  =  {/  €  A  :  f(x)  <  h}  and  A-  =  {/  €  A  :  /( x)  >  h  +  e} 
are  nonempty  and  satisfy  the  definition  of  an  e-splitting  tree.  However,  A+ 
and  A-  no  longer  form  a  partition  of  A:  it  is  very  likely  that  some  of  the 
functions  in  A  happen  to  take  values  in  the  “forbidden”  region  [h,  h  +  e],  and 
these  functions  must  be  thrown  out  in  the  construction  of  the  tree.  The  key 
problem  that  we  face  is  that  we  do  not  know  how  many  functions  we  throw 
out,  and  thus  we  have  no  control  over  the  number  of  leaves  in  the  tree. 

To  surmount  this  problem,  it  is  essential  to  find  a  coordinate  x  and  level  h 
at  which  we  can  split  the  set  A  without  discarding  too  many  functions.  This 
is  precisely  the  content  of  the  following  result.  The  price  we  pay  is  that  the 
assumption  that  A  is  a  packing  in  (T,  ||  •  ||oo)  is  too  weak  to  make  this  happen: 
we  need  the  stronger  assumption  that  A  is  a  packing  in  (T,  ||  •  ||£2/M)). 

Proposition  7.36  (Controlled  splitting).  Let  T  he  a  class  of  functions  and 
H  he  a  probability  on  X.  Let  A  be  a  ce-packing  of  (3y  ||  •  ||i,2(^\)  with  \A\  >  2. 
Then  there  exist  x  €  X  and  h  £  eL  such  that  the  sets 


A-  =  {/  G  A  :  f(x)  <  h},  A+  =  {/  e  A  :  f(x)  >h  +  e} 
satisfy  l^l+j1/2  +  l-A-l1/2  >  I-/II1/2. 

Proof.  The  idea  is  quite  simple.  Let  us  choose  two  random  elements  a,  a'  £  A 
drawn  uniformly  and  independently.  By  assumption  ||a  —  a'W^^  >  ce  as  long 
as  «  /  a',  which  happens  with  probability  1  —  ^-j-  >  ^ .  Thus 

ClY  <  (i  -  |^|)c2e2  -  Ell°  —  a'\\2L2(n)  =  J EKz)  -  a'(x) |2  p(dx). 

Thus  we  can  certainly  choose  x  £X  such  that 
2  2 

- <  E|a(a;)  —  ^(a;)!2  =  2Var[a(x)]. 

We  now  want  to  find  h  £  eL  such  that 

P[o(a)  <  h]1'2  +  P[o(a:)  >  h  +  e]1/2  >  1. 

Indeed,  as  we  have  P[a(x)  <  h]  =  and  P[a(x)  >  h  +  e]  =  the  proof 
would  evidently  be  complete  once  we  can  find  such  an  h. 
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At  this  point,  it  seems  the  proof  should  reduce  to  a  general  probabilistic 
principle:  if  Var[A']  >  C2e2  for  C  1,  then  it  should  not  be  possible  that 
most  of  the  probability  mass  of  X  is  concentrated  in  an  interval  of  size  <  e. 
This  is  precisely  the  statement  of  the  following  result  to  be  proved  below. 

Lemma  7.37.  There  is  a  universal  constant  C  such  that  z/Var[A]  >  C2e2, 
then  there  exists  b  £  ffi.  such  that  P[X  <  6]1/2  +  P[A  >  b  +  e]1/2  >  1. 

The  only  remaining  issue  is  that  Lemma  7.37  yields  b  €  R,  while  we  need 
h  £  eZ.  This  is  easily  resolved,  however.  Choose  the  universal  constant  c  =  4C. 
As  we  have  Var[a(a;)]  >  C2(2e)2,  Lemma  7.37  yields  b  G  K.  such  that 

P[o(or)  <  6] 1/2  +  P[a(x)  >  &  +2e]1/2  >  1. 

Now  choose  h  to  be  the  value  of  b  rounded  upwards  to  the  nearest  multiple 
of  e.  Then  b  <  h  <  b  +  e,  and  the  proof  is  readily  completed.  □ 

It  remains  to  prove  the  small  deviation  principle  used  above. 

Proof  (Lemma  1.31).  We  prove  the  contrapositive.  Suppose  the  conclusion 
fails,  that  is,  that  P[A  <  b]1/2  +  P[A  >  b  +  e]1/2  <  1  for  all  b  £  R.  Then 

P  [X  >  b  +  e]  <  P  [X  >  b] 2 ,  P  [X  <  b]  <  P  [X  <  b  +  e] 2  for  all  b  £  R, 

where  we  used  P[X  <  b]  <  P[A  <  6]1/2  (P[X  >  b  +  e]  <  P[A  >  b  +  e J1/2)  in 
the  first  (second)  inequality.  Let  M  =  med(X)  be  the  median  of  X.  Iterating 
these  inequalities  starting  from  P[A  >  M)  <  |  (P[A  <  M]  <  \)  yields 

P[X  >  M  +  he ]  <  2^2  ,  P[A  <  M  —  ks]  <  2-2  for  all  k  £  N. 

Thus  the  random  variable  X  has  very  thin  tail  probabilties.  But  a  random 
variable  with  thin  tails  certainly  cannot  have  large  variance:  to  be  precise, 

00  Mk+l)e 

Var[X]  <  E[(X  -  M)2]  =  ^  /  2t  P[\X  -  M\  >t]dt<  C2e 2 

fc= o  J  ke 

with  C 2  =  +  l)2-2<\  Thus  the  contrapositive  is  proved.  □ 

With  Proposition  7.36  in  hand,  we  can  now  construct  a  large  e-splitting 
tree  in  a  greedy  fashion  in  the  same  manner  as  we  did  in  the  case  of  sets. 

Corollary  7.38.  Let  T  be  a  class  of  functions  and  /i  be  a  probability  on  X. 
Let  S  be  a  ce-packing  of  (T,  ||  •  \\l2m)-  There  exists  a  e-splitting  tree  A  with 

| {leaves  o/A}|  >  ISI1^2- 
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Proof.  Grow  the  £-splitting  tree  A  by  starting  with  S  as  the  root  and  repeat¬ 
edly  splitting  the  leaves  of  the  tree  into  two  subsets  using  Proposition  7.36 
until  all  leaves  are  singletons.  By  construction,  we  have  lyi+j1/2  +  l-d-l1/2  > 
|yL|!/2  for  every  .A  G  A.  Iterating  this  bound  starting  at  the  root  gives 

|S|1/2  <  ^2  l^|1/2  =  I  {leaves  in  A}|, 

A  is  a  leaf 

and  the  proof  is  complete.  □ 

Combining  Lemma  7.35  and  Corollary  7.38  yields  Theorem  7.32. 

Remark  7.39.  There  is  nothing  special  about  the  power  |S|1,/2  in  Theorem  7.32: 
the  statement  remains  valid  if  |9|1,/2  is  replaced  by  ISI1^0  for  any  0  <  a  <  1 
at  the  expense  of  changing  the  value  of  the  universal  constant  c.  To  see  this, 
note  that  the  origin  of  the  power  ^  is  in  Lemma  7.37,  where  the  precise  value 
of  the  power  is  however  entirely  irrelevant  in  the  proof.  We  have  stated  the 
above  results  in  terms  of  ISI1^2  merely  to  avoid  notational  distractions  (the 
value  of  the  power  ultimately  affects  only  the  constants  in  Theorem  7.30). 


Problems 

7.12  (VC-subgraph  classes  and  pseudodimension).  There  is  a  simple 
method  to  extend  the  bound  of  Theorem  7.16  for  classes  of  sets  to  classes  of 
functions  without  introducing  the  notion  of  combinatorial  dimension.  Given 
a  class  of  functions  T  on  a  set  X,  define  an  associated  class  of  sets  Gg-  as 

Gg  :=  {C  c  X  x  R  :  C  =  {{x,t)  :  t  <  f(x)},  f  G  1}. 

That  is,  Gg  is  the  class  of  subgraphs  of  functions  in  3\  We  now  define  the 
pseudodimension  vc(T)  as  the  VC-dimension  vc(Cg-)  of  the  subgraphs. 

a.  Deduce  directly  from  Theorem  7.16  that  if  T  is  a  class  of  functions  such 
that  ll/Hoo  <  1  for  all  /  G  T,  then  there  is  a  universal  constant  K  such  that 

/K\Kv°m 

sup  N(G,  ||  •  \\l2(h),£)  <  f  —  )  for  all  £  <  1. 

Hint:  consider  (Cgy  ||  •  llz^^ig^))  with  A  the  uniform  distribution  on  [—1,1]. 

b.  Show  that  the  linear  class  T  in  Example  7.28  satisfies  vc(fF)  <  oo,  but  that 
the  bounded  variation  class  in  Example  7.29  satisfies  vc(T)  =  oo. 

At  first  sight,  pseudodimension  and  combinatorial  dimension  seem  to  yield  two 
distinct  methods  to  bound  the  uniform  covering  numbers  of  function  classes. 
However,  this  is  not  the  case:  the  result  of  part  a.  is  none  other  than  a  special 
case  of  Theorem  7.30  for  classes  of  finite  metric  dimension. 


7.3  Combinatorial  dimension  and  uniform  covering  233 


c.  Show  that  vc(fF)  =  supe>0vc(3r,  e),  and  conclude  that  the  result  of  part  a. 
follows  as  a  special  case  of  Theorem  7.30. 

7.13  (Combinatorial  dimension  of  convex  classes).  The  notion  of  com¬ 
binatorial  dimension  is  designed  to  be  meaningful  for  any  class  of  functions  3\ 
If  we  assume  that  the  class  T  is  convex,  however,  the  combinatorial  dimension 
can  be  given  a  simple  geometric  interpretation:  (T,  e)  is  the  largest  dimension 
of  a  cube  of  side  length  e  that  is  contained  in  a  coordinate  projection  of  2f. 

a.  Suppose  that  1  is  convex.  Show  that 

(/,  h)  is  £-shattered  if  and  only  if  T|/  D  h  +  [0,  e]1 . 

Hint:  assume  the  conclusion  is  false;  use  the  separating  hyperplane  theorem 
and  reason  as  in  Example  7.28  to  generate  a  contradiction. 


b.  Suppose  that  £F  is  convex  and  symmetric.  Show  that 

I  is  e-shattered  if  and  only  if  £F| /  3  [—  |,  |]J. 

Hint:  reason  as  in  Example  7.29. 

If  £F  is  not  convex,  one  might  expect  that  (J,  h)  is  £-shattered  if  and  only  if 
the  convex  hull  of  T  contains  a  cube  convT|/  ~Dh+  [0,£]7.  This  is  not  true, 
however:  conv  T  can  have  many  more  shattered  sets  than  T  itself. 

c.  Let  1  =  {l{i}  :  i  £  N}  be  a  class  of  indicator  functions  on  N.  Show  that 
vc(T,  £)  =  1  for  all  £  <  1,  but  that  vc(convT,  £)  diverges  as  £  j  0.  Thus  the 
convex  hull  of  a  finite-dimensional  class  can  even  be  infinite-dimensional. 

This  example  raises  a  basic  question:  when  T  is  not  convex,  what  can  be  said 
about  the  combinatorial  dimension  of  the  convex  hull  vc(conv3y  e)  in  terms 
of  vc(3y  £)?  Surprisingly,  Theorem  7.30  can  help  us  address  this  question. 

cl.  If  {x\, . . . ,  xn}  C  X  is  £-shatterecl  by  T  and  gi,-..,gn~  i-i-d.  iV(0, 1),  prove 


:=  E 


n 


supY ~]gif(xi) 


Hint:  replace  f(xi)  by  f(xi)  —  hi  —  |  in  the  definition  of  /:/(T),  and  choose 
the  functions  /  to  cancel  the  signs  of  the  Gaussian  variables  gi. 


e.  Suppose  that  ||/||oo  <  1  for  all  /£  J.  Show  that  for  any  S  >  0 
G(90  ^  nS  +  \fn  f  \J K  vc(T,  t/K)  \og(K/t)  dt 

J, 5 

<  nS  +  \/nvc(9r,  S/K). 


Hint:  recall  Theorem  5.31. 
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f.  Let  £F  be  a  class  of  functions  such  that  ||/||oo  <  1  for  all  /  €  “3.  Show  that 

vcfjF 

vcfconv  T,  Le)  <  - ^ —  for  all  e  >  0, 

where  L  is  a  universal  constant. 

Hint:  show  that  £i(3:)  =  ^/(convfF)  and  combine  the  previous  two  parts. 

7.14  (Elton’s  theorem).  The  notion  of  combinatorial  dimension  has  its  ori¬ 
gin  not  in  probability  theory  but  in  geometric  functional  analysis.  Let  us  use 
the  machinery  we  have  developed  to  prove  a  classic  result  in  this  direction. 

Let  (B,  ||-||_b)  be  a  Banach  space.  We  are  interested  in  the  question  whether 
the  finite-dimensional  Banach  space  £"  embeds  into  B:  that  is,  whether  one 
can  find  vectors  x\,. . .  ,xn  G  B  whose  linear  span  is  isomorphic  to  £™  in  the 
sense  that  there  exist  constants  C\ ,  C2  (independent  of  n)  such  that 


i= 1 


CliXi 


<^E 


for  all  a  <E  M". 


The  upper  bound  is  trivial:  if  we  choose  any  xi,...,xn  in  the  unit  ball  of 
B  (i.e.,  ||a;j||,B  <  1)  then  the  upper  bound  holds  for  C2  =  1  by  the  triangle 
inequality.  The  difficulty  is  to  understand  what  spaces  B  admit  a  lower  bound. 
If  the  lower  bound  holds,  then  we  obtain  as  a  special  case  that 


||  ±  x\  ±  •  •  •  ±  xu\\b  >  C\n 


for  all  possible  choices  of  signs;  when  this  is  the  case,  we  say  that  t?"  sign- 
embeds  into  B.  The  converse  is  far  from  clear,  however:  if  t!"  sign-embeds  into 
B ,  does  this  already  imply  a  full  embedding  as  defined  above? 

Elton’s  theorem  provides  an  answer  to  this  question.  In  fact,  Elton  only 
makes  the  weaker  assumption  that  the  sign-embedding  holds  “on  average”  in 
the  sense  that  there  exist  x\,...,xn  in  the  unit  ball  of  B  and  6  >  0  such  that 


E 


n 

^  ^  £z  Xi 
i— 1 


>  Sn, 

B 


where  £i,...,en  are  i.i.d.  symmetric  Bernoulli  variables  (random  signs).  Un¬ 
der  this  assumption,  we  will  prove  the  following  quantitative  form  of  Elton’s 
theorem:  there  is  a  subset  I  C  {1, . . .  ,n}  of  cardinality  |/|  >  cS2n  such  that 


e*5>|  < 

iei 


E  aiXi 

iei 


<  E  iaii 

B  iei 


for  all  a  €  R", 


where  c  is  a  universal  constant.  Thus  the  existence  of  a  random  sign¬ 
embedding  of  with  dimension  n  and  constant  d  implies  the  existence  of 
an  embedding  of  £™  with  dimension  n'  >  n  and  constant  >  6. 
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a.  Let  E>1  be  the  unit  ball  in  the  dual  space  of  B ,  and  define 

T={f  '■  {xu-..,xn}  ->  M  :  f(x)  =  ( y,x ),  y  G  H*}. 

Show  that  {xi  :  i  G  1}  is  2e-shattered  by  T  if  and  only  if 

V  a-iXi  =  sup  W  ciif(xi)  >  e  W  la*l  f°r  a  e 
ie/  b  ie/  iei 

Hint:  use  the  ideas  from  the  first  part  of  Problem  7.13. 

b.  Show  that  for  all  e  >  0 

n 

E  ^  ^  &iXi 

i= 1 

Hint:  argue  as  in  the  second  part  of  Problem  7.13. 

c.  Complete  the  proof  of  Elton’s  theorem  in  the  form  stated  above. 

7.4  The  iteration  method 

We  have  developed  in  the  previous  section  a  powerful  combinatorial  bound  on 
the  uniform  covering  numbers  of  classes  of  functions.  This  bound  suffices  in 
many  cases  to  obtain  distribution-free  control  of  the  supremum  of  empirical 
processes.  It  is  of  significant  interest,  however,  to  understand  how  sharp  such 
bounds  are  in  general:  does  combinatorial  dimension  capture  completely  the 
size  of  the  uniform  covering  numbers?  To  gain  some  insight  into  this  question, 
let  us  begin  by  developing  a  simple  lower  bound. 

Lemma  7.40.  Let  1  be  a  class  of  functions  on  X  that  is  uniformly  bounded 
sup^gg-  H/lloo  <  1-  Then  for  universal  constants  C,  c  and  all  e  >  0 

|vc(3r,4e)  <  log  sup?V(T ,  ||  •  <  CvcfT,  ce)  log 

Proof.  The  upper  bound  is  Theorem  7.30.  To  prove  the  lower  bound,  let  (/,  h) 
be  a  4e-shattered  pair  with  |/|  =  vc(T,  4e),  and  let  fj,  be  the  uniform  distribu¬ 
tion  on  I.  The  proof  follows  once  we  show  log?V(T,  ||  •  \\l2(h),£)  >  |/|/8. 

The  establish  this  claim,  choose  for  every  J  C  /  a  function  fj  G  fF  such 
that  fj(x)  <  h(x)  for  x  £  J  and  fj{x)  >  h(x)  +  4e  for  x  G  I\J.  Then 
||  fj  ~  /j'||l2(m)  >  4£y/|7pT|JAJ7|  for  every  J,  J'  C  I.  By  Lemma  7.24,  there 
exists  a  family  3  of  subsets  of  /  with  \3\  >  e^1^8  such  that  |  JA  J'\  >  |J|/4  for 
every  J,  J'  G  3,  J  yf  J' .  Then  {fj  :  J  G  3}  is  evidently  a  2e-packing  of  T,  and 
the  claim  follows  by  the  duality  between  packing  and  covering.  □ 
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Lemma  7.40  suggests  that  our  combinatorial  bounds  are  not  far  from  being 
sharp:  up  to  universal  constants,  the  lower  and  upper  bounds  in  Lemma  7.40 
differ  only  by  a  logarithmic  factor  ~  log(l/e).  The  immediate  question  that 
arises  at  this  point  is  whether  we  can  close  the  gap  between  the  upper  and 
lower  bounds:  perhaps  an  improved  upper  bound  can  eliminate  the  logarithmic 
factor,  or  perhaps  an  improved  lower  bound  can  add  an  additional  logarithmic 
factor?  Unfortunately,  no  improvement  of  this  kind  is  possible:  the  logarithmic 
factor  is  sharp  for  some  classes  T  but  not  for  others. 

Example  7./1.  Let  X  =  N  and  T  =  {!{;}  ;  *  €  N}.  Then  vc(fF,  e)  =  1  for  all 
0  <  e  <  1.  On  the  other  hand,  if  fi  is  the  uniform  distribution  on  Nfl  [1,  l/8e2], 
then  we  have  ||l/q  —  lm||L2Qd  >  2e  for  alii,  j  <  [T/8e2] ,  i  ^  j,  which  implies 
log  N(3r,  ||  •  e)  ^  log(l/e)  by  duality  of  packing  and  covering.  Thus  in 

this  case  the  logarithmic  factor  in  the  upper  bound  of  Lemma  7.40  is  sharp. 

Example  7 ./2.  Let  X  =  [0, 1]  and  T  =  {/  €  Lip(X)  :  0  <  /  <  1}.  It  is  easily 
shown  as  in  Example  7.29  that  vc(T,  e)  =  1  +  |_1  / grj  for  all  0  <  e  <  1  (the 
upper  bound  follows  immediately  from  Example  7.29;  for  the  lower  bound, 
repeating  the  proof  in  Example  7.29  with  piecewise  linear  functions  fj  shows 
that  /  =  {ks  :  0  <  k  <  |_l/ej }  is  e-shattered).  On  the  other  hand,  we  have 
proved  in  Lemma  5.16  that  log  Ar(T,  ||  •  \\l2(h),£)  ^  1/e  for  every  probability 
measure  p.  Thus  in  this  case  the  lower  bound  in  Lemma  7.40  is  sharp,  while 
the  upper  bound  contains  an  unnecessary  logarithmic  factor. 

For  what  classes  must  the  logarithmic  factor  to  appear  and  when  it  is  un¬ 
necessary?  In  the  remainder  of  this  section,  we  will  develop  a  method  that  will 
make  it  possible  in  many  cases  to  resolve  the  mystery  of  the  logarithmic  factor. 
In  concrete  applications  this  will  often  not  yield  a  major  improvement:  the 
logarithmic  factor  tends  to  be  innocuous  except  in  borderline  cases.  Nonethe¬ 
less,  a  better  understanding  of  uniform  covering  bounds  can  lead  to  sharper 
results  in  certain  problems,  and  deepens  our  fundamental  understanding  of 
the  connections  between  covering  numbers  and  combinatorial  dimension.  More 
importantly,  the  iteration  method  that  we  will  develop  for  this  purpose  is  of 
significant  interest  in  its  own  right,  and  can  be  used  to  great  effect  in  many 
other  problems  (see,  for  example,  Problem  7.17  below). 

In  order  to  understand  how  one  might  eliminate  the  logarithmic  factor, 
let  us  begin  with  an  elementary  observation.  While  this  might  not  be  entirely 
obvious  at  first  sight,  the  bound  of  Theorem  7.30  depends  on  two  distinct 
scales:  on  the  one  hand,  we  are  covering  the  class  T  by  balls  of  radius  s;  on 
the  other  hand,  we  have  assumed  that  the  class  T  is  itself  uniformly  bounded 
by  supjggr  H/lloo  <  1.  If  we  were  to  assume  instead  that  supjeg-  \\f\\oo  <  a, 
then  applying  Theorem  7.30  to  the  scaled  class  T/a  readily  yields 

fog  A7(3r,  ||  •  ||z,2(m),  e)  <  C  vc(T ,  ce)  fog 

for  every  e  >  0  and  every  probability  measure  p.  Thus  the  logarithmic  factor 
in  Lemma  7.40  does  not  depend  on  e,  but  rather  on  the  ratio  e/a  between  the 
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scale  of  the  cover  and  the  size  of  the  class  3\  The  logarithmic  factor  would 
disappear  entirely  if  a  <  e,  but  this  is  not  adequate:  the  size  of  the  class  is 
fixed,  while  we  are  interested  in  the  behavior  of  the  covering  numbers  as  e  J,  0. 
Nonetheless,  we  will  be  able  to  exploit  the  fact  the  we  have  better  covering 
number  bounds  for  classes  with  controlled  size  to  systematically  improve  our 
covering  number  bounds  for  arbitrary  classes.  This  is  the  idea  behind  the 
iteration  method,  which  we  develop  presently  in  a  general  setting. 

Let  (T,  d)  be  a  metric  space,  and  suppose  that  can  bound  the  covering 
number  of  any  ball  B(t,  2e)  of  radius  2e  by  balls  of  radius  £  as  follows: 

log  iV(T  (~l  B(t,  2 e),d,  e)  <  <p(e). 

We  would  like  to  obtain  a  bound  on  the  covering  number  N (T,  d ,  e)  of  the 
entire  set  T .  To  this  end,  let  us  first  cover  T  by  N(T,  d,  2e)  balls  of  radius 
2e,  and  then  cover  each  of  these  balls  by  balls  of  radius  s.  Then  evidently  the 
union  of  the  latter  balls  is  a  cover  of  T  by  balls  of  radius  e,  and  there  are  at 
most  e^£’ N(T,  d,  2e)  such  balls.  We  have  therefore  shown  that 

log  N (T,  d,  e)  <y{e)+  log  N (T,  d,  2e) . 

We  can  now  iterate  this  bound  to  obtain 

OO 

logAT(T,d,£)  <  5>(2fc£) 

fc=0 

(note  that  if  T  has  finite  diameter,  then  log  N(T,  d,  2fe£)  =  0  for  k  sufficiently 
large  and  the  remainder  term  in  the  iteration  vanishes;  while  if  T  has  infinite 
diameter,  then  ip(s)  >  log 2  for  all  £  >  0  and  the  inequality  holds  trivially). 

Despite  its  simplicity,  this  procedure  already  explains  the  difference  be¬ 
tween  Examples  7.41  and  7.42.  Let  us  assume  for  the  moment  that  we  can 
apply  the  above  iteration  method  with  tp(s)  <  vc(3r,  ce)  (this  is  not  entirely 
obvious  at  this  point,  but  this  idea  will  be  made  precise  in  the  remainder  of 
this  section).  In  Example  7.42,  we  have  <p(e)  <  1  /£,  so 

1  00  1 

loglV(T,d,£)<7^2-fc<-. 

k= 0 

Thus  we  have  eliminated  the  logarithmic  term  in  Lemma  7.40!  On  the  other 
hand,  in  Example  7.41  we  have  ip{e)  <  1  and  vc(3r,  e)  =  0  for  £  >  1,  so  that 

log(l/ce) 

log  N(T,d,e)<  7>(2fc£)<log 

fc= o 

Thus  in  this  case  the  logarithmic  term  in  Lemma  7.40  remains  in  place.  This 
computation  explains  much  of  the  mystery  of  the  logarithmic  term:  the  lower 
bound  in  Lemma  7.40  is  sharp  for  infinite-dimensional  classes  for  which  the 
combinatorial  dimension  vc(T,  e)  is  at  least  polynomial  in  1/e,  while  the  upper 
bound  is  sharp  for  finite-dimensional  classes  when  vc(T,  £)  is  constant. 
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Remark  1 .4.3.  The  iteration  method  should  be  understood  as  the  direct  ana¬ 
logue  for  covering  numbers  of  the  chaining  method.  In  the  chaining  method, 
we  aim  to  obtain  a  bound  on  the  supremum  of  a  general  random  process  on 
T  starting  from  such  a  bound  for  the  special  case  where  the  cardinality  |T|  is 
controlled.  To  this  end,  we  approximate  the  supremum  of  a  general  process  by 
the  supremum  over  a  finite  set  plus  a  remainder  term  that  is  of  the  same  form 
as  the  original  supremum,  and  iterate  this  bound  until  the  remainder  term 
is  eliminated.  In  a  completely  analogous  manner,  the  iteration  method  allows 
to  obtain  a  bound  on  the  covering  number  of  the  set  T  starting  from  such 
a  bound  for  the  special  case  where  the  diameter  of  T  is  controlled.  Even  if 
we  can  directly  estimate  N(T,  d,  e)  as  in  Lemma  7.40,  iteration  systematically 
improves  this  bound  by  exploiting  the  control  on  the  diameter  at  each  scale. 

The  above  discussion  contains  the  key  idea  that  will  be  developed  in  the 
sequel.  Unfortunately,  we  cannot  immediately  apply  the  above  computation 
to  obtain  bounds  in  terms  of  combinatorial  dimension.  In  order  to  apply  the 
simple  iteration  method  developed  above,  we  would  require  that 

log  ||  •  ||z,2(m),£)  <  C  vc(T ,  ce)  log  (jy'j 

for  all  e  >  0  whenever  supjg3-  ||/||l2(m)  <  a.  However,  we  have  only  proved 
such  a  bound  when  supjggr  ||/||oo  <  a,  which  does  not  suffice.  Indeed,  using 
the  latter  bound,  the  first  step  of  the  iteration  method  would  yield 

log ,  ||  •  ||L2 (M),e)  <  CTog(2C)  vc(T, cs)  +logiV(9r,  ||  •  ||oo, 2er), 

but  then  no  control  of  the  remainder  term  is  possible  as  the  L °° -covering 
numbers  are  generally  infinite  (as  is  the  case,  for  example,  for  classes  of  sets). 
On  the  other  hand,  we  did  not  use  the  uniform  bound  supyggr  \\f\\oo  <  a  hr 
the  proof  of  Theorem  7.30  in  a  very  sharp  manner,  so  that  one  might  hope 
that  an  improvement  of  the  proof  would  show  that  the  conclusion  of  Theorem 
7.30  remains  valid  under  the  assumption  supygg- 1| /||  z,2(yu,)  <  a.  Unfortunately, 
this  also  cannot  be  the  case,  as  the  following  simple  example  demonstrates. 

Example  7. 44-  Let  X  =  [0, 1]  and  let  p  be  the  uniform  distribution  on  X.  Let 

^ a  =  •  ||l[a,b]  ||l2(/a)  —  n}. 

It  is  a  trivial  exercise  to  show  that  vc(3r0,  e)  =  2  for  all  0  <  e  <  1. 

On  the  other  hand,  let  Ck  =  [{k  —  1)£2,A;£2].  As  ||1cJ|l2(/x)  =  £  and 
II  lCfc  -  1  Cl  ||l2(/x)  =  21/2£  for  all  1  <k,l  <  [ e~2j ,  k  ^  l,  we  can  estimate 

N(?e,\\-\\L*M,2-1/2e)>  L£-2J 

by  the  duality  of  covering  and  packing.  Thus  it  is  not  possible  to  replace  the 
assumption  sup^  ||/||oo  <  1  by  supj  || /|| i2 (^)  <  1  hr  Theorem  7.30,  as  that 
would  imply  that  N(3re,  ||  •  2_1/2£)  can  be  bounded  uniformly  in  e. 
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Despite  this  discouraging  example,  things  are  not  quite  as  bad  as  they 
seem.  While  it  is  not  possible  to  replace  supy  \\f\\oo  <  1  by  supy  ||/||z,2(^)  <  1 
in  Theorem  7.30,  we  will  show  that  a  significant  improvement  is  still  possible: 
it  suffices  to  assume  supy  ||/||lp(m)  <  1  for  any  p  >  2!  In  fact,  we  will  prove  a 
more  general  result  that  is  essential  for  implementing  the  iteration  method. 

Theorem  7.45  (Rudelson-Vershynin).  Let  T  be  a  class  of  functions  on  X 
and  letp  >  2.  Suppose  that  supyggr  ||/||l2p(»  <  a  for  some  probability  p.  Then 

for  all  0  <  e  <  a, 

where  C ,  c  are  universal  constants. 

Remark  7./6.  There  is  nothing  special  about  the  bound  supy  H/Hl2*^)  <  a: 
the  same  proof  will  go  through  if  supy  ||/||l,0j>(m)  <  a  for  any  (3  >  1,  provided 
that  we  replace  the  constants  C ,  c  by  Cp  =  C/3/  (/ 3  —  1)  and  cp  =  c{(3  —  1)  A  c, 
cf.  Problem  7.15.  As  we  will  only  need  to  apply  this  result  for  a  fixed  value  of 
(3,  however,  we  have  fixed  (3  =  2  above  for  notational  convenience. 

Theorem  7.45  is  all  we  need  to  apply  the  iteration  method.  The  idea  is 
exactly  the  same  as  in  the  simple  iteration  method  discussed  above:  the  only 
new  feature  is  that  we  must  use  a  different  Lp-norm  in  every  stage  of  the 
iteration  in  order  to  eliminate  the  logarithmic  factor.  Before  we  turn  to  the 
proof  of  Theorem  7.45,  let  us  explore  the  consequences  of  this  idea. 

Corollary  7.47  (Iteration).  Let  T  be  a  class  of  functions  on  X.  Then 

OO 

logsup./V(fF ,  |  •  ||L2(/i),£)  <  4Clog(a/c)  vc(T,cafee) 

P  k—0 

for  any  a  >  1,  where  C,c  are  universal  constants. 

Proof.  Fix  a  probability  measure  p,  and  let  p  >  1  and  £  >  0.  Define  Bp(f,  e)  = 
{d  '■  llff  —  f\\ L? (fj.)  <  £}•  By  covering  T  by  L2p- balls  of  radius  ae,  and  then 
covering  each  of  these  balls  by  Lp-balls  of  radius  £,  we  can  estimate 

N(3,\\  ■  |Up(([1),£)  <  supN(Jr\B2p(f,ae),\\  ■  ||tP(At),£)  N(J,  ||  •  ||L2P(  ),  as). 

Applying  Theorem  7.45  to  {T  —  /}  D  B.2p(0,  ae)  yields 

log7V(T ,  |  •  || LP(/i),£)  <  C log(a/ c)  p2  vc(T ,  ce)  +  fog  N(3r ,  ||  •  ||l2P((u),  ae). 

Iterating  this  bound  starting  at  p  =  2  readily  yields  the  result,  provided  that 
the  remainder  term  logA^T,  ||  •  ||i2Il+i^, ane)  vanishes  as  n  — >  00. 

To  see  this,  note  that  if  supy  g63r||/  —  gHoo  =  00,  then  vc(T, e)  >  1  for 
all  £  >  0  and  thus  the  iteration  bound  holds  trivially.  On  ther  other  hand,  if 
sup/  ge£F  IIZ-sIloo  <  00,  then  7V(T,  ||  •  ||i2n+i (p),ane)  <  N(3r ,  ||  •  ||oo ,ane)  =  1 
for  all  n  sufficiently  large  and  thus  the  remainder  term  converges  to  zero.  □ 


log  ,  ||  •  || lp(h),s)  <  Cp2  vc(T ,  ce)  log 


240  7  Empirical  processes  and  combinatorics 


Using  Corollary  7.47,  we  can  readily  understand  when  the  lower  bound  in 
Lemma  7.40  is  sharp:  this  is  always  the  case  for  classes  whose  combinatorial 
dimension  is  at  least  polynomial.  This  yields  a  sharp  bound,  up  to  universal 
constants,  for  most  infinite-dimensional  classes  of  practical  interest. 

Corollary  7.48  (Infinite-dimensional  classes).  Let  1  be  a  class  of  func¬ 
tions  on  X.  Suppose  there  is  a  function  £  :  R+  — >  R+  and  a  >  1  such  that 

vc(3r,  £■)  <  ^(e),  £(ae)  <  5(e)/8  for  all  £  >  0. 

Then 


log  sup  N(3r,  ||  •  ||l2(ju),  e)  <  8CTog(a/c)5(ce)  for  all  e  >  0. 

A* 

In  particular,  z/vc(3r,  e)  is  comparable  to  5(e)  in  the  sense  that 
£(s/K)  <  vc(3r,  e)  <  5(£)  for  all  e  >  0 
holds  for  some  constant  K ,  then 

vc(T,  4e)  <  log  sup  N(3,  ||  •  || l2(h),£)  <  vc(T,  Kce)  for  all  £  >  0. 

Proof.  The  upper  bound  follows  immediately  from  Corollary  7.47  and  the 
property  $,(ak£)  <  8_fc5(e).  The  lower  bound  follows  from  Lemma  7.40.  □ 


In  applications  to  empirical  processes,  we  are  typically  interested  not  in 
7V(T,  ||  •  ||z,2 (ai),  e)  *n  own  right,  but  rather  in  the  chaining  bound  that  arises 
from  symmetrization.  Applying  Theorem  7.30  yields  the  upper  bound 


/vc (T,  e)  log(l/e)  de, 


and  we  have  seen  that  the  logarithmic  factor  can  be  removed  for  most 
infinite-dimensional  classes.  Surprisingly,  however,  the  latter  assumption  is 
not  needed:  the  logarithmic  factor  can  always  be  removed  in  the  entropy  in¬ 
tegral  without  any  further  assumptions!  While  this  is  a  remarkable  result, 
it  should  not  come  as  a  great  surprise:  we  have  essentially  already  used  the 
iteration  method  in  the  proof  of  Theorem  6.19  in  the  same  manner. 


Corollary  7.49  (Entropy  integral  and  combinatorial  dimension).  Let 

1  be  a  class  of  functions  on  X.  Then  we  have 


Proof.  The  lower  bound  follow  immediately  from  Lemma  7.40.  For  the  upper 
bound,  note  that  we  have  by  Corollary  7.47  with  a  =  4 


Integrating  both  sides  and  a  simple  change  of  variables  yields  the  proof.  □ 
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The  remainder  of  this  section  is  devoted  to  the  proof  of  Theorem  7.45. 
Somewhat  surprisingly,  the  difficulty  of  the  proof  does  not  lie  in  the  combina¬ 
torial  aspect  of  the  problem,  which  is  where  most  of  our  efforts  were  spent  in 
the  previous  sections:  the  combinatorial  part  of  the  proof  follows  essentially 
along  the  same  lines  as  in  the  proof  of  Theorem  7.30.  As  will  become  clear  in 
due  course,  the  real  difficulty  of  Theorem  7.45  is  that  the  probabilistic  extrac¬ 
tion  principle  provided  by  Lemma  7.17  is  no  longer  adequate  when  we  only 
assume  that  the  class  is  bounded  in  Lp  rather  than  in  L°°. 

Let  us  begin,  however,  with  the  combinatorial  part  of  the  proof.  Following 
the  proof  of  Theorem  7.30,  we  first  obtain  an  analogue  of  Theorem  7.32. 

Theorem  7.50.  Let  T  be  a  class  of  functions  and  let  /i  be  a  probability  on  X. 
Then  for  any  S  C  J  that  is  a  ce-packing  of  (3y  ||  •  ||lp(ai))  for  p  >  2,  we  have 

|S|1/p  <  \{(I,h)  :  (/,  h)  is  an  e-cube} \. 

Here  c  is  a  universal  constant. 

The  proof  is  almost  identical  to  that  of  Theorem  7.32,  and  we  only  sketch 
the  necessary  changes.  We  first  extend  Lemma  7.37.  It  is  not  at  all  surprising 
that  this  is  possible:  we  left  a  lot  of  room  in  the  proof  of  Lemma  7.37. 

Lemma  7.51.  There  is  a  universal  constant  C  so  that  if~E[\X  —  med(X)|p]  > 
Cpep  for  some  p>  2,  then  P[A  <  b}1^  +  P[A  >  b  +  e ']1/p  >  1  for  some  b  £  R. 

Proof.  Suppose  that  the  conclusion  fails.  Then  it  follows  that 

P[|X  -  med(X)|  >  ke ]  <  21~pk  for  all  k  £  N 

as  in  the  proof  of  Lemma  7.37.  Therefore 

00  r(k+l)e 

E[|X  -  med(X)|p]  =  J2  ~  medWI  >t]dt<  Cpep, 

fc= (rfc£ 

where  we  used  {2p^£l0(fc+l)p-12-pfe}1/p  <  2e{lTY^k=\(k+T)2-2k  2}  =:  C 
as  p  >  2.  Thus  we  proved  the  contrapositive  of  the  result.  □ 

Proof  (Theorem  7.50).  We  must  only  prove  an  analogue  of  Proposition  7.36: 
the  remainder  of  the  proof  is  identical  to  that  of  Theorem  7.32. 

To  this  end,  let  A  be  a  ce-packing  of  (T,  ||  •  ||lp(^))  with  \A\  >  2,  and  draw 
random  elements  a,  a'  £  A  uniformly  and  independently.  Then 

<  Ell«-  a'\\PLv{li)  =  J  E\a(x)  -  a'{x)\p  p(dx). 

Thus  there  exists  i£X  such  that 

<  E|a(x)  —  n'(x)\p  <  2P  E|a(a;)  —  med(a(a:))|p, 
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where  we  used  the  triangle  inequality  \a  —  a' \  <  |a  —  med(a)|  +  \a!  —  med(a')| 
and  convexity  (x  +  y)p  <  2p~1(xp  +  yp).  We  can  now  apply  Lemma  7.51,  and 
the  remainder  of  the  proof  is  identical  to  that  of  Proposition  7.36.  □ 


Next,  we  prove  an  analogue  of  Corollary  7.33  in  the  present  setting.  The 
main  difficulty  here  is  that  we  no  longer  have  boundedness  of  the  class  in  L°° 
but  only  in  L2p.  At  this  stage,  however,  this  is  only  a  minimal  inconvenience: 
even  boundedness  in  L 1  suffices,  and  the  proof  is  an  exercise  in  counting. 


Corollary  7.52.  Let  fF  be  a  class  of  functions  on  a  finite  set  X,  and  let  p  be 
the  uniform  distribution  on  X.  Suppose  that  ||/||li(m)  <  a  for  all  f  G  T.  Then 
for  any  p  >  2  and  ce-packing  S  of  (fF,  ||  •  || lp(m))  with  £  <  a,  we  have 


|S|1/P< 


/  4e2a|X| 
\£vc(1F,  e) 


2vc(3r,e) 


Proof.  First,  we  claim  that  if  (I,h)  is  an  £-cube,  then  ^  7  \h(x)\  <  a|X|. 

Indeed,  as  (I,  h)  is  e-shattered,  we  can  find  /  G  J  such  that  f(x)  <  h{x)  if 
h(x)  <  0  and  f(x)  >  h(x)  +  £  if  h(x)  >  0.  This  implies,  in  particular,  that 
fs  \f(x)\  f°r  x  &  I,  and  thus  the  claim  follows  from  WfW^tn)  <  a. 

Now  note  that  given  a  fixed  set  I  C  X  with  |/|  =  k,  we  have 

\{h  G  ( £lY  :  Y^xei  \h(x)\  <  a|X|}| 

<  2k\{m1,...,mk  G  Z+  :  J2i=imi  <  alxl/£}l 
=  2k\{m1,...,mk  G  N  :  Y^i=i  mi  <  o|X|/e  +  fe}|. 

As  ru  =  to,,  defines  a  one-to-one  correspondence  between  sequences  of 
integers  mi, . . . ,  m.k  >  1  such  that  mi  —  N  and  increasing  sequences  of 

integers  1  <  ri  <  •  •  •  <  rk  <  N  (of  which  there  are  (Jl)),  we  obtain 


\{h  :  (I,  h )  is  an  e-cube}|  <  2k 


[a|X|/eJ  +  k\  f  4ea\  /|X|\ 

k  )-(~r)  (k)’ 


where  we  used  (j-)k  <  ( .  )  <  ( ^ ) fc  in  the  second  inequality.  Therefore 


|  { (I,  h)  is  an  e- 


The  right-hand  side  can  be  estimated  as  in  the  proof  of  Lemma  7.12,  and  the 
proof  is  completed  by  applying  Theorem  7.50.  □ 


The  combinatorial  part  of  the  proof  is  now  complete,  and  all  that  remains 
is  to  apply  a  probabilistic  extraction  principle.  It  is  not  obvious  how  to  do  this, 
however,  as  Lemma  7.17  uses  uniform  boundedness  supy  ||/||oo  <  1  in  a  fun¬ 
damental  manner.  To  see  why,  note  that  in  order  for  the  extraction  principle 
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to  yield  a  nontrivial  bound  in  conjunction  with  Corollary  7.52,  the  number  of 
samples  r  in  the  extraction  principle  can  be  at  most  (poly) logarithmic  in  the 
size  of  the  packing.  In  Lemma  7.17,  the  uniform  boundedness  assumption  en¬ 
sures  that  the  random  norm  || /*  —  fj\\ )  is  a  subgaussian  random  variable, 
so  that  a  logarithmic  number  of  samples  suffices  by  a  simple  union  bound.  If 
we  only  have  control  of  the  form  sup y  j|/||9  <  1  for  some  q  <  oo,  however, 
the  best  we  can  hope  for  is  a  polynomial  tail  probability  for  \\fi  ~  fj\\PLp^  y 
and  thus  a  simple  union  bound  gives  a  polynomial  rather  than  logarithmic 
number  of  samples.  This  does  not  suffice  to  conclude  the  proof. 

We  must  therefore  develop  a  more  sophisticated  extraction  principle.  The 
key  idea  that  makes  this  possible  is  that,  instead  of  working  directly  with  the 
Lp  norms  ||/j  —  /j  Ulp^),  we  should  focus  attention  on  the  tail  probabilities 
/j(|/j  —  fj\  >t).  The  following  simple  lemma  shows  how  this  can  be  done. 

Lemma  7.53.  Let  g  be  a  measurable  function  on  the  measure  space  (X,  p).  If 
\\g\\Lp(n)  >  ei  then  for  any  a  >  1  there  exists  t  >  0  so  that 

tapp(\9\>t)> 

Conversely,  if  ||fl,||z/p(/J,)  <  £,  then  tpp(\g\  >  t)  <  ep  for  all  t  >  0. 

Proof.  Suppose  that  p(\h\  >  t)  <  t~ap  for  all  t  >  0.  Then  we  can  estimate 

r°°  n 

IICW<1+  /  ptp-1g{\h\>t)dt<  — 

Inserting  h  =  {yffiY^g/s  readily  yields  the  contrapositive  of  the  first  asser¬ 
tion.  The  second  assertion  is  immediate  from  Chebyshev’s  inequality.  □ 

The  key  advantage  of  working  with  tail  probabilities  rather  than  Lp  norms 
is  that  the  empirical  measure  pr(\fi  ~  fj  I  >  t)  is  subgaussian,  and  we  can 
therefore  use  a  simple  union  bound  to  control  the  empirical  tail  probabilities 
using  a  number  of  samples  that  is  only  logarithmic  in  the  size  of  the  packing. 
On  the  other  hand,  Lemma  7.53  shows  that  separation  in  Lp  yields  a  tail 
bound  of  order  t~p  only  if  we  are  willing  to  lose  slightly  in  the  exponent 
p'  >  p.  This  explains  why  it  is  essential  for  dimension-free  control  of  Lp- 
covering  numbers  that  the  class  is  Lp  -bounded  for  p '  >  p.  Once  this  idea  has 
been  understood,  it  is  not  difficult  to  work  out  the  details. 

Proposition  7.54  (Weak  extraction).  Let  p  >  1,  a  >  £  >  0,  m  >  2,  and 

let  p  be  a  probability  measure  onX.  If  fi, . . . ,  fm  are  functions  on  X  such  that 

\\fih^)<a,  ||  fi~f;  i\\Lp(ii)  >  £  for  all  1  <  i  <  j  <  m, 

then  there  exist  r  <  C(2a/e)12p  log  m  points  x\,...,xr  £  X  and  a  subset 
J  C  {1, . . . ,  to}  of  cardinality  \  J\  >  to/2  such  that 

\\Ml2p(^)  <  2 a,  ||  fi  -  fj\\Lzp /2(/i*)  >  e/9  for  all  i,j  G  J,  i  ^  j, 

where  px  :=  -  Y^k=i  and  C  is  a  universal  constant. 
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Proof.  Let  X\, . . . ,  Xr  ~  /. i  be  i.i.d. ,  and  denote  by  /ir  their  empirical  measure. 
We  begin  by  controlling  the  L2p(/rr)-norm  of  the  functions  /}.  Note  that 

t>ni  t  ii  0  !  Il/illj^p^)  1 

PUI/illL’pOv)  >  2a]  <  (2a)2p  <  4 

by  Chebyshev’s  inequality.  We  therefore  have 

m  o 

E|{i  =  ll/ilMfor)  <  Ml  =  <  2a]  >  M 

i=l 


Using  E[Z]  <  u  +  \\ZWocP\Z  >  u],  we  can  estimate 


P 


|{*  :  \\ filler  (fir-)  <  2a}  |  > 


m 

2 


> 


1 

4' 


Thus  with  probability  more  than  one  quarter,  at  least  half  of  the  functions  /,; 
remain  bounded  as  ||/i||z,2p(A(r)  <  2a  under  the  empirical  measure. 

We  now  turn  to  controlling  the  separation  between  the  functions  /}.  Ap¬ 
plying  Lemma  7.53  with  a  =  3/2,  we  choose  tjj  >  0  for  every  i  <  j  so  that 

Rearranging  yields  (e/ijj)3p/2  >  3_9/2(e/2a)6p.  We  can  therefore  estimate 
using  the  Azuma-Hoeffding  inequality 


P 
<  P 


3p/2 , 
ij 


tT^rQfi  -fi  I  >  t«)  <  3-2M2 


IJ 

1-15/ 


^Vd/i  —  /j  |  >  Uj)  <  t^/2n{\fi  -  fj\  >  tij)  -  3  3£3p/2 


A  union  bound  now  gives 


2Pr(\fi  -  fj\  >  Uj)  >  3  2e3p/2Vi<j  >  1  —  ?n2e  r3  (e/2a)  P>- 


for  r  >  (2a/e)12p  log  m.  In  particular,  Lemma  7.53  implies  that 


fi  ~  fj\\L3p/2(nr)  >  £/9  for  all  i  <  j 


3 

>4 


for  r  >  (2a/e)12p  log  m.  Thus  with  probability  more  that  three  quarters,  all 
functions  fi  are  separated  by  e/9  in  L3p/2(/ir)  under  the  empirical  measure. 

Now  note  that  the  sum  of  the  probabilities  of  the  events  on  which  bound¬ 
edness  and  separation  hold  under  the  empirical  measure  exceeds  one  if  we 
choose  r  =  |_C(2a/e)12plogTOj  for  a  sufficiently  large  universal  constant  C. 
Thus  these  events  cannot  be  disjoint,  and  we  can  select  a  sample  X\,. . .  ,xr 
in  their  intersection.  The  conclusion  of  the  proof  follows  readily.  □ 
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We  now  have  all  the  ingredients  to  complete  the  proof  of  Theorem  7.45. 

Proof  ( Theorem  7.45).  Let  /i,...,/TO  €  J  be  a  e-packing  of  (£F,  ||  •  ||lp(^)) 
of  cardinality  m  >  N(3r,  ||  •  || LP(^),e).  By  Proposition  7.54  there  exist  r  < 
C(2a/e)12plogm  points  x\,...,xr  G  X  and  /i,...,/z  €  T  with  l  >  m/2  such 
that  \\fi\\L2P^r)  <  2 a  and  ||  ft  -  fj\\L3P/2^  >  e/9  for  all  1  <  i  <  j  <  l,  where 
is  the  uniform  distribution  on  x\, . . . ,  xr.  By  Corollary  7.52,  we  have 

Ka  \  39p2  vc('3r ,£/9c-)  /  logm  \  3pvc(g-,£/9c) 
e  )  \6pvc(T,  e/9c)/ 

for  a  universal  constant  K.  Using  a  log  m  <  ma  and  rearranging,  this  yields 

/  tj.  \  78p2  vc(3‘,e/9c) 

N(?,\\.\\LPM,e)<m< 

This  completes  the  proof.  □ 

Problems 

7.15  (Improved  uniform  covering  bounds).  In  order  to  keep  the  notation 
minimal,  we  made  some  arbitrary  choices  in  the  statement  and  proof  of  Theo¬ 
rem  7.45.  By  carefully  keeping  track  of  the  constants  in  the  proof,  extend  The¬ 
orem  7.45  to  bound  N(3r,  || •  || z^p(/i) > £)  under  the  assumption  sup H/Hl^p^)  <  a 
for  any  p  >  1  and  f3  >  1  as  indicated  in  Remark  7.46. 

7.16  (L°°-covering  numbers  and  combinatorial  dimension).  Through¬ 
out  this  chapter,  we  have  obtained  dimension-free  estimates  for  Lp-covering 
numbers  with  p  <  oo.  One  cannot  expect  to  obtain  dimension-free  Un¬ 
covering  numbers,  however.  For  example,  when  T  is  a  class  of  indicator  func¬ 
tions  on  a  finite  set  X,  then  N(3r,  ||  •  ||oo,  e)  =  |T|  for  all  0  <  e  <  1  and  thus  any 
nontrivial  L°°-covering  number  bound  must  depend  on  |X|.  While  this  depen¬ 
dence  is  in  general  exponential,  the  Sauer-Shelah  Lemma  7.12  states  that  the 
U°°-covering  numbers  grow  only  polynomially  in  |X|  for  VC-classes  of  sets.  It 
is  natural  to  ask  whether  this  is  also  true  for  general  function  classes. 

a.  Let  X  be  a  finite  set  and  let  p  be  the  uniform  distribution  on  X.  Show  that 
e_1  H/Hoo  <  H/ILipgm^)  <  ll/lloo  for  every  function  /  on  X. 

b.  Deduce  from  Corollary  7.52  that  if  T  is  a  class  of  functions  on  a  finite  set 
X  such  that  ||/||oo  <  1  for  all  /£?,  then  for  universal  constants  c,C 


log  N(9,  ||  •  llo^e)  <  2  vc(T ,  ce)  log  |X|  log 
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For  classes  of  sets  C  the  Sauer-Shelah  lemma  implies  log  iV(C,  ||  •  ||oo>£) 
log  |X|,  while  we  have  obtained  above  a  bound  of  order  log2  |X|  for  arbitrary 
function  classes  3r.  It  is  not  known  whether  a  polynomial  bound  is  possible 
in  the  general  setting.  However,  we  can  achieve  nearly  polynomial  scaling  by 
improving  the  above  bound  to  log1+<5  |X|  for  any  <5  >  0. 

c.  The  small  deviation  result  of  Lemma  7.51  is  not  the  most  efficient.  Show 
that  the  conclusion  can  be  improved  to  P[X  <  b]1^  +P[X  >  b+e]1^  >  1 
for  any  S  >  0,  with  the  constant  C  depending  on  S  but  not  on  p. 

d.  Prove  a  general  bound  of  order  log  N(3r,  ||  •  ||oo,£)  <  log1+<5  |X|. 

e.  Similarly,  the  scaling  oc  p2  of  the  bound  of  Theorem  7.45  is  not  the  best 
possible.  Show  that  the  scaling  can  be  improved  to  oc  p1+s  for  any  <5  >  0. 

7.17  (Iteration  and  Sudakov’s  inequality).  We  have  systematically  de¬ 
veloped  upper  and  lower  bounds  for  the  suprema  of  random  processes  in  terms 
of  covering  numbers.  An  implicit  motivation  for  these  results  is  that  it  is  of¬ 
ten  easier  to  bound  the  covering  numbers  of  a  set  T  than  to  bound  directly  a 
random  process  defined  on  T.  However,  these  results  prove  to  be  useful  also 
in  the  converse  direction:  there  are  situations  where  a  direct  estimate  on  the 
supremum  of  a  random  process  can  be  used  to  obtain  nontrivial  bounds  for 
covering  numbers  that  are  otherwise  hard  to  compute. 

The  simplest  result  that  can  be  used  for  this  purpose  is  Sudakov’s  inequal¬ 
ity.  Let  T  C  H(0, 1)  be  a  subset  of  the  Euclidean  unit  ball  in  M",  and 

n 

Xt  :=  yt9iti,  u>(s)  :=  supE  sup  Xt 

^ s£T  _  t^TOB^s^c') 

where  gi,-..,gn  are  i.i.cl.  JV(0,1).  Note  that  {Xt}t^T  is  a  Gaussian  process 
whose  natural  distance  d  is  the  Euclidean  distance.  We  can  therefore  estimate 

log  N(T,d,£)<^^. 

£a 

How  good  is  this  bound?  Unfortunately,  it  leaves  something  to  be  desired. 

a.  Let  T  =  B( 0, 1)  be  the  Euclidean  unit  ball.  Show  that  Sudakov’s  inequality 
yields  at  best  log  N(T,d,e)  <  n/e2.  On  the  other  hand,  show  that  in  fact 
log  N(T,  d,  e)  x  nlog(l/e),  which  is  far  better  than  is  predicted  by  Sudakov. 

It  is  not  too  surprising  that  Sudakov’s  inequality  fails  to  capture  the  cor¬ 
rect  behavior  of  the  covering  numbers  even  in  the  simplest  possible  example: 
w(l)  <  oo  can  hold  even  for  infinite-dimensional  classes,  and  thus  we  cannot 
predict  correctly  the  behavior  of  the  covering  numbers  on  the  basis  of  this 
quantity  only.  On  the  other  hand,  the  local  modulus  of  continuity  u>(s)  con¬ 
tains  much  more  information.  It  can  be  exploited  using  an  iteration  argument. 
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b.  Show  that  for  any  e  >  0 


log  N(T,  d,  e)  <  E  l2fce<l 


UJ 


(2  k+1ef 


k= 0 


( 2kef 


< 


u{2xY 


dx. 


c.  Show  that  if  T  =  5(0, 1)  is  the  Euclidean  unit  ball,  then  u>(x)  <  Xyfn  and 
thus  iteration  yields  a  covering  number  estimate  of  the  correct  order. 


Notes 

§7.1.  The  symmetrization  method,  which  has  its  origin  in  probability  in  Ba¬ 
nach  spaces,  has  been  a  fundamental  part  of  empirical  process  theory  following 
the  influential  work  of  Gine  and  Zinn  [40].  A  slightly  different  form  of  sym¬ 
metrization  was  already  used  by  Vapnik  and  Chervonenkis  [92].  Lemma  7.6  is 
due  to  Panchenko  [64] .  The  characterization  of  Bernoulli  processes  mentioned 
in  Problem  7.1  was  proved  by  Bednorz  and  Latala  [9]  (see  also  [89]  for  an 
exposition).  The  simple  contraction  method  used  in  Problem  7.1  is  classical 
[51],  while  the  “inverse”  Gaussian  symmetrization  method  is  based  on  [67]. 
Problem  7.2  is  based  on  [40]  (the  result  developed  here  dates  back  to  [93]).  See 
also  [82]  for  more  precise  characterizations  of  the  Glivenko-Cantelli  property. 
Much  more  on  self-normalized  processes  (Problem  7.3)  can  be  found  in  [65]. 
The  contraction  principle  of  Problem  7.4  can  be  found  in  [51]. 

§7.2.  The  notion  of  VC-dimension  and  its  application  to  the  Glivenko-Cantelli 
problem  were  developed  by  Vapnik  and  Chervonenkis  [92] .  The  Sauer-Shelah 
lemma  was  proved  by  Sauer  in  answer  to  a  question  posed  by  Erdos  [72]; 
an  infinite  version  of  it  appeared  in  work  on  mathematical  logic  by  Shelah. 
Theorem  7.16  is  due  to  Dudley  [29].  Uniform  Glivenko-Cantelli  classes  were 
studied  systematically  by  Dudley,  Gine  and  Zinn  [32]  and  Alon  et  al.  [4]. 
Pajor’s  formulation  of  the  Sauer-Shelah  lemma  is  from  [63].  The  somewhat 
pedantic  proof  we  have  given  here  (based  on  [60])  is  intended  to  prepare  the 
reader  for  the  next  section.  Classical  proofs  are  developed  in  Problems  7.7 
and  7.8.  The  formulation  of  the  Glivenko-Cantelli  theorem  in  Problem  7.10  is 
due  to  Steele  [74];  the  example  of  convex  sets  follows  the  treatment  in  [68]. 
Problem  7.11  gives  a  very  brief  introduction  to  the  topic  of  uniform  central 
limit  theorems  that  has  historically  motivated  many  developments  in  empirical 
process  theory;  textbook  treatments  can  be  found  in  [30,  91]. 

§7.3.  The  notion  of  combinatorial  dimension  has  its  origin  in  Banach  space 
theory.  It  was  used  implicitly  by  Elton  [34]  following  the  development  of  an 
infinite  counterpart  of  this  idea  by  Rosenthal  [69]  to  characterize  Banach 
spaces  that  embed  t\  (see  [45]  for  the  probabilistic  significance  of  the  latter 
notion).  A  first  result  along  the  lines  of  Theorem  7.30,  but  with  much  worse 
scaling,  is  due  to  Pajor  [63].  Theorem  7.30,  due  to  Mendelson  and  Vershynin 
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[60],  is  essentially  the  best  possible.  The  much  simpler  notion  of  VC-subgraph 
classes  (Problem  7.12)  appeared  independently,  cf.  [68].  Problem  7.13  is  taken 
from  [61],  while  the  approach  of  Problem  7.14  follows  [60]. 

§7.4.  The  lower  bound  in  Lemma  7.40  is  from  [87].  The  iteration  method 
is  often  used  in  Banach  space  theory;  see,  for  example,  [5]  for  an  interesting 
application.  Example  7.44  is  inspired  by  the  example  given  in  [6,  Lemma  4.9]. 
Theorem  7.45  and  its  use  as  an  iteration  principle  are  due  to  Rudelson  and 
Vershynin  [70],  and  we  follow  a  simplified  version  of  their  proof.  L °°  -covering 
bounds  in  terms  of  combinatorial  dimension  (Problem  7.16)  were  first  obtained 
in  [4]  with  a  worse  scaling.  Problem  7.17  is  inspired  by  [15]. 
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